Comment by andrethegiant

10 months ago

I’m working on a centralized platform[1] to help web crawlers be polite by default by respecting robots.txt, 429s, etc, and sharing a platform-wide TTL cache just for crawlers. The goal is to reduce global bot traffic by providing a convenient option to crawler authors that makes their bots play nice with the open web.

[1] https://crawlspace.dev

9 comments

andrethegiant

jsheard 10 months ago

> respecting robots.txt

What agent name should we put in robots.txt to deny your crawler without using a wildcard? I can't see that documented anywhere.

andrethegiant 10 months ago
Thanks for the feedback, it’s mentioned in the platform FAQ but I should make it more prominent in the docs. The UA will always be prefixed with the string `Crawlspace`. May I ask why you’d want to block it, even if it crawls respectfully?
- jsheard 10 months ago
  
  The bot having "Crawlspace" in its UA doesn't necessarily mean it honors "Crawlspace" directives in robots.txt. Would it bail out if it saw this robots.txt?
  User-agent: Crawlspace Disallow: /
  > May I ask why you’d want to block it, even if it crawls respectfully?
  The main audience for the product seems to be AI companies, and some people just aren't interested in feeding that beast. Lots of sites block Common Crawl even though their bot is usually polite.
  https://originality.ai/ai-bot-blocking
  
  1 reply →

philipwhiuk 10 months ago

Are you sure you're not just encouraging more people to run them?

andrethegiant 10 months ago
Even so, traffic funnels through the same cache, so website owners would see the same amount of hits whether there was 1 crawler or 1000 crawlers on the platform
- philipwhiuk 10 months ago
  
  So you're forcing me and OP to sign-up for the protection racket.... Thanks

subarctic 10 months ago

How does it compare to common crawl?

andrethegiant 10 months ago

CommonCrawl’s data is up to a month old. Here you can write a custom crawler, and have a REST/RAG API that you can use with your apps and agents for the data that your crawler finds. All the while, it builds up a platform-wide cache, so duplicate/redundant requests don’t reach their origin servers.