Comment by jeroenhd
1 year ago
I've such bots on my server. Some Chinese Huawei bot as well as an American one.
They ignored robots.txt (claimed not to, but I blacklisted them there and they didn't stop) and started randomly generating image paths. At some point /img/123.png became /img/123.png?a=123 or whatever, and they just kept adding parameters and subpaths for no good reason. Nginx dutifully ignored the extra parameters and kept sending the same images files over and over again, wasting everyone's time and bandwidth.
I was able to block these bots by just blocking the entire IP range at the firewall level (for Huawei I had to block all of China Telecom and later a huge range owned by Tencent for similar reasons).
I have lost all faith in scrapers. I've written my own scrapers too, but almost all of the scrapers I've come across are nefarious. Some scour the internet searching for personal data to sell, some look for websites to send hack attempts at to brute force bug bounty programs, others are just scraping for more AI content. Until the scraping industry starts behaving, I can't feel bad for people blocking these things even if they hurt small search engines.
Why not just ignore the bots? I have a Linode VPS, cheapest tier, and I get 1TB of network transfer a month. The bots that you're concerned about use a tiny fraction of that (<1%). I'm not behind a CDN and I've never put effort into banning at the IP level or setting up fail2ban.
I get that there might be some feeling of righteous justice that comes from removing these entries from your Nginx logs, but it also seems like there's a lot of self-induced stress that comes from monitoring failed Nginx and ssh logs.
Honestly, it should just come down to rate limiting and what you’re willing to serve and to whom. If you’re a free information idealist like me, I’m OK with bots accessing public web-serving servers, but not OK with allowing them to consume all my bandwidth and compute cycles. Furthermore, I’m also not OK with legitimate users consuming all my resources. So I should employ strategies that prevent individual clients or groups of clients from endlessly submitting requests, whether the format of the requests make sense or are “junk.”
Rate limiting doesn't help if the requests are split under hundreds of sessions. Especially if your account creation process was also bot friendly.
Fundamentally it's adversarial, so expecting a single simple concept to properly cover even half of the problematic requests is unrealistic.
Rate limiting could help when an automated process is scanning arbitrary, generated URLs, inevitably generating a shitton of 404 errors -- something your rate limiting logic can easily check for (depending on server/proxy software of course). Normal users or even normal bots won't generate excessive 404's in a short time frame, so that's potentially a pretty simple metric by which apply a rate limit. Just an idea though, I've not done that myself...
2 replies →
Rate limiting based on IP, blocking obvious datacenter ASNs and blocking identifiable JA3 fingerprints is quite simple and surprisingly effective in stopping most scrapers and can be done entirely server side, I wouldn't be surprised if this catches more than half of problematic requests to the average website. But I agree that if you have a website "worth" scraping there will probably be some individuals motivated enough to bypass those restrictions.
3 replies →
Sounds like a problem easily solved with fail2ban. Which keeps legitimate folks in, and offenders out - and also unbans after a set amount of time, to avoid dynamic IPs screwing over legitimate users permanently.