← Back to context

Comment by andrethegiant

3 months ago

I’m working on a centralized platform[1] to help web crawlers be polite by default by respecting robots.txt, 429s, etc, and sharing a platform-wide TTL cache just for crawlers. The goal is to reduce global bot traffic by providing a convenient option to crawler authors that makes their bots play nice with the open web.

[1] https://crawlspace.dev

> respecting robots.txt

What agent name should we put in robots.txt to deny your crawler without using a wildcard? I can't see that documented anywhere.

  • Thanks for the feedback, it’s mentioned in the platform FAQ but I should make it more prominent in the docs. The UA will always be prefixed with the string `Crawlspace`. May I ask why you’d want to block it, even if it crawls respectfully?

    • The bot having "Crawlspace" in its UA doesn't necessarily mean it honors "Crawlspace" directives in robots.txt. Would it bail out if it saw this robots.txt?

        User-agent: Crawlspace
        Disallow: /
      

      > May I ask why you’d want to block it, even if it crawls respectfully?

      The main audience for the product seems to be AI companies, and some people just aren't interested in feeding that beast. Lots of sites block Common Crawl even though their bot is usually polite.

      https://originality.ai/ai-bot-blocking

      1 reply →

How does it compare to common crawl?

  • CommonCrawl’s data is up to a month old. Here you can write a custom crawler, and have a REST/RAG API that you can use with your apps and agents for the data that your crawler finds. All the while, it builds up a platform-wide cache, so duplicate/redundant requests don’t reach their origin servers.