Comment by andrethegiant

1 year ago

CommonCrawl is supposed to help for this, i.e. crawl once and host the dataset for any interested party to download out of band. However, data can be up to a month stale, and it costs $$ to move the data out of us-east-1.

I’m working on a centralized crawling platform[1] that aims to reduce OP’s problem. A caching layer with ~24h TTL for unauthed content would shield websites from redundant bot traffic while still providing up-to-date content for AI crawlers.

[1] https://crawlspace.dev

2 comments

andrethegiant

Smerity 10 months ago

You can download Common Crawl data for free using HTTPS with no credentials. If you don't store it (streamed processing or equivalent) and you have no cost for incoming data (which most clouds don't) you're good!

You can do so by adding `https://data.commoncrawl.org/` instead of `s3://commoncrawl/` before each of the WARC/WAT/WET paths.

alphan0n 1 year ago

Laughably, CommonCrawl shows that the authors robots.txt were configured to allow all, the entire time.

https://pastebin.com/VSHMTThJ