Comment by andrethegiant

9 months ago

I’m working on a centralized platform[1] to help web crawlers be polite by default by respecting robots.txt, 429s, etc, and sharing a platform-wide TTL cache just for crawlers. The goal is to reduce global bot traffic by providing a convenient option to crawler authors that makes their bots play nice with the open web.

[1] https://crawlspace.dev

9 comments

andrethegiant

jsheard 9 months ago

> respecting robots.txt

What agent name should we put in robots.txt to deny your crawler without using a wildcard? I can't see that documented anywhere.

andrethegiant 9 months ago
Thanks for the feedback, it’s mentioned in the platform FAQ but I should make it more prominent in the docs. The UA will always be prefixed with the string `Crawlspace`. May I ask why you’d want to block it, even if it crawls respectfully?
- jsheard 9 months ago
  
  The bot having "Crawlspace" in its UA doesn't necessarily mean it honors "Crawlspace" directives in robots.txt. Would it bail out if it saw this robots.txt?
  User-agent: Crawlspace Disallow: /
  > May I ask why you’d want to block it, even if it crawls respectfully?
  The main audience for the product seems to be AI companies, and some people just aren't interested in feeding that beast. Lots of sites block Common Crawl even though their bot is usually polite.
  https://originality.ai/ai-bot-blocking
  
  1 reply →

philipwhiuk 9 months ago

Are you sure you're not just encouraging more people to run them?

andrethegiant 9 months ago
Even so, traffic funnels through the same cache, so website owners would see the same amount of hits whether there was 1 crawler or 1000 crawlers on the platform
- philipwhiuk 9 months ago
  
  So you're forcing me and OP to sign-up for the protection racket.... Thanks

subarctic 9 months ago

How does it compare to common crawl?

andrethegiant 9 months ago

CommonCrawl’s data is up to a month old. Here you can write a custom crawler, and have a REST/RAG API that you can use with your apps and agents for the data that your crawler finds. All the while, it builds up a platform-wide cache, so duplicate/redundant requests don’t reach their origin servers.