Comment by username223

9 days ago

I'm part of that small but (hopefully) growing percentage, because Common Crawl is a deeply dishonest front for AI data scraping. Quoting Wikipedia:

""" In November 2025, an investigation by technology journalist Alex Reisner for The Atlantic revealed that Common Crawl lied when it claimed it respected paywalls in its scraping and requests from publishers to have their content removed from its databases. It included misleading results in the public search function on its website that showed no entries for websites that had requested their archives be removed, when in fact those sites were still included in its scrapes used by AI companies. """

My site is CC-BY-NC-SA, i.e. non-commercial and with attribution, and Common Crawl took a dubious position on whether fair use makes that irrelevant. They can burn.

8 comments

username223

ccgreg 9 days ago

Did you see our reply? https://commoncrawl.org/blog/setting-the-record-straight-com...

Also, if your site has CC-BY-NC-SA markings, we have preserved them.

username223 9 days ago
Hopefully my site is no longer part of Common Crawl. I'm not interested in participating in your project, block CCBot in robots.txt, and have requested deletion of my data via your form.
- ccgreg 9 days ago
  
  Did you see our reply? Edit: by which I mean, we sent you an email that explains what we did and how to verify it. Did you not receive an email reply? If not, please contact us again.
  Also, if your site has CC-BY-NC-SA markings, we have preserved them.
  
  3 replies →

ccgreg 9 days ago

Oh, and thanks for letting me know that I need to add our reply to Wikipedia.

samtheDamned 8 days ago

From my basic experience editing Wikipedia I'm not sure you should edit the page of your own project. Maybe add a discussion for it instead? Or perhaps I'm mistaken.