Comment by buro9

10 months ago

Nginx, it's nothing special it's just my load balancer.

if ($http_user_agent ~* (list|of|case|insensitive|things|to|block)) {return 403;}

403 is generally a bad way to get crawlers to go away - https://developers.google.com/search/blog/2023/02/dont-404-m... suggests a 500, 503, or 429 HTTP status code.

  • > 403 is generally a bad way to get crawlers to go away

    Hardly... the article links says that a 403 will cause Google to stop crawling and remove content... that's the desired outcome.

    I'm not trying to rate limit, I'm telling them to go away.

  • That article describes the exact behaviour you want from the AI crawlers. If you let them know they’re rate limited they’ll just change IP or user agent.

From the article:

> If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really).

It would be interesting if you had any data about this, since you seem like you would notice who behaves "better" and who tries every trick to get around blocks.

  • Switching to sending wrong, inexpensive data might be preferable to blocking them.

    I've used this with voip scanners.

    • Oh I did this with the Facebook one and redirected them to a 100MB file of garbage that is part of the Cloudflare speed test... they hit this so many times that it would've been 2PB sent in a matter of hours.

      I contacted the network team at Cloudflare to apologise and also to confirm whether Facebook did actually follow the redirect... it's hard for Cloudflare to see 2PB, that kind of number is too small on a global scale when it's occurred over a few hours, but given that it was only a single PoP that would've handled it, then it would've been visible.

      It was not visible, which means we can conclude that Facebook were not following redirects, or if they were, they were just queuing it for later and would only hit it once and not multiple times.

      1 reply →