← Back to context

Comment by andrethegiant

11 hours ago

I’m working on a centralized platform[1] to help web crawlers be polite by default by respecting robots.txt, 429s, etc, and sharing a platform-wide TTL cache just for crawlers. The goal is to reduce global bot traffic by providing a convenient option to crawler authors that makes their bots play nice with the open web.

[1] https://crawlspace.dev

Are you sure you're not just encouraging more people to run them?

  • Even so, traffic funnels through the same cache, so website owners would see the same amount of hits whether there was 1 crawler or 1000 crawlers on the platform

How does it compare to common crawl?

  • CommonCrawl’s data is up to a month old. Here you can write a custom crawler, and have a REST/RAG API that you can use with your apps and agents for the data that your crawler finds. All the while, it builds up a platform-wide cache, so duplicate/redundant requests don’t reach their origin servers.