Comment by simondotau

2 hours ago

The important thing is to be aware of your adversary. If it’s a big network which doesn’t care about you specifically, block away. But if it’s a motivated group interested in your site specifically, then you have to be very careful. The extreme example of the latter is yt-dlp, which continues to work despite YouTube’s best efforts.

For those adversaries, you need to work out a careful balance between deterrence, solving problems (e.g. resource abuse), and your desire to “win”. In extreme cases your best strategy is for your filter to “work” but be broken in hard to detect ways. For example, showing all but the most valuable content. Or spiking the data with just enough rubbish to diminish its value. Or having the content indexes return delayed/stale/incomplete data.

And whatever you do, don’t use incrementing integers. Ask me how I know.

2 comments

simondotau

grishka 2 hours ago

In my particular case, I don't mind the crawling. It's a fediverse server. There is nothing secret there. All content is available via ActivityPub anyway for anyone to grab. However, these bots specifically violated both robots.txt and rel="nofollow" while hitting endpoints like "log in to like this post" pages tens of times per second. They were just wasting my server's resources for nothing.

simondotau 4 minutes ago

The modern problem is naive content harvesting bots, for AI training. My advice is to make sure you have a very efficient code path for login pages. 10 pages per second is nothing if you don’t have to perform any database queries (because you don’t have any authentication token to validate).