← Back to context

Comment by pogue

6 months ago

What do you use to block them?

8 comments

pogue

Reply

buro9 6 months ago

Nginx, it's nothing special it's just my load balancer.

if ($http_user_agent ~* (list|of|case|insensitive|things|to|block)) {return 403;}

l1n 6 months ago
403 is generally a bad way to get crawlers to go away - https://developers.google.com/search/blog/2023/02/dont-404-m... suggests a 500, 503, or 429 HTTP status code.
- buro9 6 months ago
  
  > 403 is generally a bad way to get crawlers to go away
  Hardly... the article links says that a 403 will cause Google to stop crawling and remove content... that's the desired outcome.
  I'm not trying to rate limit, I'm telling them to go away.
- vultour 6 months ago
  
  That article describes the exact behaviour you want from the AI crawlers. If you let them know they’re rate limited they’ll just change IP or user agent.
gs17 6 months ago
From the article:
> If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really).
It would be interesting if you had any data about this, since you seem like you would notice who behaves "better" and who tries every trick to get around blocks.
- Libcat99 6 months ago
  
  Switching to sending wrong, inexpensive data might be preferable to blocking them.
  I've used this with voip scanners.
  
  2 replies →