Comment by grishka

3 months ago

In my particular case, I don't mind the crawling. It's a fediverse server. There is nothing secret there. All content is available via ActivityPub anyway for anyone to grab. However, these bots specifically violated both robots.txt and rel="nofollow" while hitting endpoints like "log in to like this post" pages tens of times per second. They were just wasting my server's resources for nothing.

2 comments

grishka

simondotau 3 months ago

My base advice is to make sure you have a very efficient code path for login pages. 10 pages per second is nothing if you don’t have to perform any database queries (because you don’t have any authentication token to validate).

Beyond that, look for how the bots are finding new URLs to probe, and don’t give them access to those lists/indexes. In particular, don’t forget about site maps. I use cloudflare rules to restrict my site map to known bots only.

grishka 3 months ago

Of course. My server wasn't struggling with that. I haven't benchmarked that server, but on an M1 Max, the app can easily serve hundreds of requests per second for profile pages, which is the heaviest thing an unauthenticated user can access (I cache a lot in memory, but posts, photos, and friend lists aren't among that). It was just a mild annoyance.
They discovered those URLs simply by parsing pages that contain like buttons. Those do have rel="nofollow" on them, and the URL pattern is disallowed in robots.txt, but I'd be surprised it that'd stop someone who uses thousands of IPs to proxy their requests. I don't have a site map.