Comment by grishka

7 hours ago

Thank you very much for the observation about headers. I just looked closer at the bot traffic I'm currently receiving on my small fediverse server and noticed that it's user agents of old Chrome versions but also that the Accept-Language header is never set, which is indeed something that no real Chromium browser would do. So I added a rule to my nginx config to return a 403 to these requests. The amount of these per second seems to have started declining.

The important thing is to be aware of your adversary. If it’s a big network which doesn’t care about you specifically, block away. But if it’s a motivated group interested in your site specifically, then you have to be very careful. The extreme example of the latter is yt-dlp, which continues to work despite YouTube’s best efforts.

For those adversaries, you need to work out a careful balance between deterrence, solving problems (e.g. resource abuse), and your desire to “win”. In extreme cases your best strategy is for your filter to “work” but be broken in hard to detect ways. For example, showing all but the most valuable content. Or spiking the data with just enough rubbish to diminish its value. Or having the content indexes return delayed/stale/incomplete data.

And whatever you do, don’t use incrementing integers. Ask me how I know.

  • In my particular case, I don't mind the crawling. It's a fediverse server. There is nothing secret there. All content is available via ActivityPub anyway for anyone to grab. However, these bots specifically violated both robots.txt and rel="nofollow" while hitting endpoints like "log in to like this post" pages tens of times per second. They were just wasting my server's resources for nothing.

It's been a few hours. These particular bots have completely stopped. There are still some bot-looking requests in the log, with a newer-version Chrome UA on both Mac and Windows, but there aren't nearly as many of them.

Config snippet for anyone interested:

    if ($http_user_agent ~* "Chrome/\d{2,3}\.\d+\.\d{2,}\.\d{2,}") {
      set $block 1;
    }
    if ($http_accept_language = "") {
      set $block "${block}1";
    }
    if ($block = "11") {
      return 403;
    }

That's a simple and effective way to block a lot of bots, gonna implement that on my sites. Thanks!