Comment by anthonyhn

2 years ago

For those not using cloudflare but who have access to web server config files and want to block AI bots, I put together a set of prebuilt configs[0] (for Apache, Nginx, Lighttpd, and Caddy) that will block most AI bots from scraping contents. The configs are built on top of public data sources[1] with various adjustments.

[0] https://github.com/anthmn/ai-bot-blocker

[1] https://darkvisitors.com/

3 comments

anthonyhn

ccgreg 2 years ago

I wonder why you're blocking CCBot? The Common Crawl Foundation predates LLMs and most of the 10,000 research papers using our corpus have nothing to do with Machine Learning.

Edit: Publishers Target Common Crawl In Fight Over AI Training Data https://www.wired.com/story/the-fight-against-ai-comes-to-a-...

anthonyhn 2 years ago
First off, I want to thank you and the other members of the CC Foundation, the CC data set is an incredible resource to everyone.
Much of the UA data, including CCBot, is from an upstream source[0]. I was torn on whether CCBot and other archival bots should be included in the configs, since these services are not AI bot scraping services. I've added an exclusion for CCBot[1] and the archival services from the recommended configs.
[0] https://darkvisitors.com/agents/ccbot
[1] https://github.com/anthmn/ai-bot-blocker/commit/ae0c2c40fd08...
- ccgreg 2 years ago
  
  Thank you!