Comment by bogwog

6 months ago

I wonder if the best solution is still just to create link mazes with garbage text like this: https://blog.cloudflare.com/ai-labyrinth/

It won't stop the crawlers immediately, but it might lead to an overhyped and underwhelming LLM release from a big name company, and force them to reassess their crawling strategy going forward?

3 comments

bogwog

ronsor 6 months ago

That won't work, because garbage data is filtered after the full dataset is collected anyway. Every LLM trainer these days knows that curation is key.

bogwog 6 months ago

If the "garbage data" is AI generated, it'll be hard or impossible to filter.

creatonez 6 months ago

Crawlers already know how to stop crawling recursive or otherwise excessive/suspicious content. They've dealt with this problem long before LLM-related crawling.