← Back to context

Comment by anthonyhn

1 year ago

For those not using cloudflare but who have access to web server config files and want to block AI bots, I put together a set of prebuilt configs[0] (for Apache, Nginx, Lighttpd, and Caddy) that will block most AI bots from scraping contents. The configs are built on top of public data sources[1] with various adjustments.

[0] https://github.com/anthmn/ai-bot-blocker

[1] https://darkvisitors.com/

I wonder why you're blocking CCBot? The Common Crawl Foundation predates LLMs and most of the 10,000 research papers using our corpus have nothing to do with Machine Learning.

Edit: Publishers Target Common Crawl In Fight Over AI Training Data https://www.wired.com/story/the-fight-against-ai-comes-to-a-...