← Back to context

Comment by pogue

20 days ago

What do you use to block them?

Nginx, it's nothing special it's just my load balancer.

if ($http_user_agent ~* (list|of|case|insensitive|things|to|block)) {return 403;}

  • 403 is generally a bad way to get crawlers to go away - https://developers.google.com/search/blog/2023/02/dont-404-m... suggests a 500, 503, or 429 HTTP status code.

    • > 403 is generally a bad way to get crawlers to go away

      Hardly... the article links says that a 403 will cause Google to stop crawling and remove content... that's the desired outcome.

      I'm not trying to rate limit, I'm telling them to go away.

    • That article describes the exact behaviour you want from the AI crawlers. If you let them know they’re rate limited they’ll just change IP or user agent.

  • From the article:

    > If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really).

    It would be interesting if you had any data about this, since you seem like you would notice who behaves "better" and who tries every trick to get around blocks.