Comment by ccgreg

1 year ago

I wonder why you're blocking CCBot? The Common Crawl Foundation predates LLMs and most of the 10,000 research papers using our corpus have nothing to do with Machine Learning.

Edit: Publishers Target Common Crawl In Fight Over AI Training Data https://www.wired.com/story/the-fight-against-ai-comes-to-a-...

First off, I want to thank you and the other members of the CC Foundation, the CC data set is an incredible resource to everyone.

Much of the UA data, including CCBot, is from an upstream source[0]. I was torn on whether CCBot and other archival bots should be included in the configs, since these services are not AI bot scraping services. I've added an exclusion for CCBot[1] and the archival services from the recommended configs.

[0] https://darkvisitors.com/agents/ccbot

[1] https://github.com/anthmn/ai-bot-blocker/commit/ae0c2c40fd08...