Comment by Smerity
18 days ago
You can download Common Crawl data for free using HTTPS with no credentials. If you don't store it (streamed processing or equivalent) and you have no cost for incoming data (which most clouds don't) you're good!
You can do so by adding `https://data.commoncrawl.org/` instead of `s3://commoncrawl/` before each of the WARC/WAT/WET paths.
No comments yet
Contribute on Hacker News ↗