← Back to context

Comment by ccgreg

9 days ago

That 20% number is for a limited list of relatively large news websites. If you include the long tail of news, the % of blocking is much smaller.

I'm part of that small but (hopefully) growing percentage, because Common Crawl is a deeply dishonest front for AI data scraping. Quoting Wikipedia:

""" In November 2025, an investigation by technology journalist Alex Reisner for The Atlantic revealed that Common Crawl lied when it claimed it respected paywalls in its scraping and requests from publishers to have their content removed from its databases. It included misleading results in the public search function on its website that showed no entries for websites that had requested their archives be removed, when in fact those sites were still included in its scrapes used by AI companies. """

My site is CC-BY-NC-SA, i.e. non-commercial and with attribution, and Common Crawl took a dubious position on whether fair use makes that irrelevant. They can burn.