← Back to context

Comment by progmetaldev

1 year ago

Whoever configures the Cloudflare rules should be turning off the firewall for things like robots.txt and sitemap.xml. You can still use caching for those resources to prevent them becoming a front door to DDoS.

It seems like common cases like this should be handled correctly by default. These are cachable requests intended for robots. Sure, it would be nice if webmasters configure it but I suspect a tiny minority does.

For example even Cloudflare hasn't configure their official blog's RSS feed properly. My feed reader (running in a DigitalOcean datacenter) hasn't been able to access it since 2021 (403 every time even though backed off to checking weekly). This is a cachable endpoint with public data intended for robots. If they can't configure their own product correctly for their official blog how can they expect other sites to?

  • I agree, but I also somewhat understand. Some people will actually pay more per month for Cloudflare than their own hosting. The Cloudflare Pro plan is $20/month USD. Some sites wouldn't be able to handle the constant requests for robots.txt, just because bots don't necessarily respect cache headers (if they are even configured for robots.txt), and the sheer number of bots that look at robots.txt and will ignore a caching header are too numerous.

    If you are writing some kind of malicious crawler that doesn't care about rate-limiting, and wants to scan as many sites as possible for the most vulnerable to get a list together to hack, you will scan robots.txt because that is the file that tells robots NOT to index these pages. I never use a robots.txt for some kind of security through obscurity. I've only ever bothered with robots.txt to make SEO easier when you can control a virtual subdirectory of a site, to block things like repeated content with alternative layouts (to avoid duplicate content issues), or to get a section of a website to drop out of SERPs for discontinued sections of a site.

    • > sheer number of bots that look at robots.txt and will ignore a caching header

      This is not relevant because Cloudflare will cache it so it never hits your origin. Unless they are adding random URL parameters (which you can teach Cloudflare to ignore but I don't think that should be a default configuration).

      1 reply →