Comment by anthonyhn
1 year ago
For those not using cloudflare but who have access to web server config files and want to block AI bots, I put together a set of prebuilt configs[0] (for Apache, Nginx, Lighttpd, and Caddy) that will block most AI bots from scraping contents. The configs are built on top of public data sources[1] with various adjustments.
I wonder why you're blocking CCBot? The Common Crawl Foundation predates LLMs and most of the 10,000 research papers using our corpus have nothing to do with Machine Learning.
Edit: Publishers Target Common Crawl In Fight Over AI Training Data https://www.wired.com/story/the-fight-against-ai-comes-to-a-...
First off, I want to thank you and the other members of the CC Foundation, the CC data set is an incredible resource to everyone.
Much of the UA data, including CCBot, is from an upstream source[0]. I was torn on whether CCBot and other archival bots should be included in the configs, since these services are not AI bot scraping services. I've added an exclusion for CCBot[1] and the archival services from the recommended configs.
[0] https://darkvisitors.com/agents/ccbot
[1] https://github.com/anthmn/ai-bot-blocker/commit/ae0c2c40fd08...
Thank you!