Comment by andrethegiant
11 hours ago
I’m working on a centralized platform[1] to help web crawlers be polite by default by respecting robots.txt, 429s, etc, and sharing a platform-wide TTL cache just for crawlers. The goal is to reduce global bot traffic by providing a convenient option to crawler authors that makes their bots play nice with the open web.
Are you sure you're not just encouraging more people to run them?
Even so, traffic funnels through the same cache, so website owners would see the same amount of hits whether there was 1 crawler or 1000 crawlers on the platform
How does it compare to common crawl?
CommonCrawl’s data is up to a month old. Here you can write a custom crawler, and have a REST/RAG API that you can use with your apps and agents for the data that your crawler finds. All the while, it builds up a platform-wide cache, so duplicate/redundant requests don’t reach their origin servers.