Comment by andrethegiant
19 days ago
CommonCrawl is supposed to help for this, i.e. crawl once and host the dataset for any interested party to download out of band. However, data can be up to a month stale, and it costs $$ to move the data out of us-east-1.
I’m working on a centralized crawling platform[1] that aims to reduce OP’s problem. A caching layer with ~24h TTL for unauthed content would shield websites from redundant bot traffic while still providing up-to-date content for AI crawlers.
You can download Common Crawl data for free using HTTPS with no credentials. If you don't store it (streamed processing or equivalent) and you have no cost for incoming data (which most clouds don't) you're good!
You can do so by adding `https://data.commoncrawl.org/` instead of `s3://commoncrawl/` before each of the WARC/WAT/WET paths.
Laughably, CommonCrawl shows that the authors robots.txt were configured to allow all, the entire time.
https://pastebin.com/VSHMTThJ