Comment by x3haloed
1 year ago
Honestly, it should just come down to rate limiting and what you’re willing to serve and to whom. If you’re a free information idealist like me, I’m OK with bots accessing public web-serving servers, but not OK with allowing them to consume all my bandwidth and compute cycles. Furthermore, I’m also not OK with legitimate users consuming all my resources. So I should employ strategies that prevent individual clients or groups of clients from endlessly submitting requests, whether the format of the requests make sense or are “junk.”
Rate limiting doesn't help if the requests are split under hundreds of sessions. Especially if your account creation process was also bot friendly.
Fundamentally it's adversarial, so expecting a single simple concept to properly cover even half of the problematic requests is unrealistic.
Rate limiting could help when an automated process is scanning arbitrary, generated URLs, inevitably generating a shitton of 404 errors -- something your rate limiting logic can easily check for (depending on server/proxy software of course). Normal users or even normal bots won't generate excessive 404's in a short time frame, so that's potentially a pretty simple metric by which apply a rate limit. Just an idea though, I've not done that myself...
I did that and it works great.
Specifically, I use fail2ban to count the 404s and ban the IP temporarily when certain threshold is exceeded in a given time frame. Every time I check fail2ban stats it has hundreds of IPs blocked.
1 reply →
Rate limiting based on IP, blocking obvious datacenter ASNs and blocking identifiable JA3 fingerprints is quite simple and surprisingly effective in stopping most scrapers and can be done entirely server side, I wouldn't be surprised if this catches more than half of problematic requests to the average website. But I agree that if you have a website "worth" scraping there will probably be some individuals motivated enough to bypass those restrictions.
> blocking obvious datacenter ASNs
You block all VPN users then, and currently many countries have some kind of censorship, please don't do that. I use a personal VPN for over 5 years and that's annoying.
I understand the other side and captcha/POW captchas/additional checks is okay. But give people a choice to be private/non-censorable.
Enabling/disabling a VPN each minute to access the non-censored local site which blocks datacenters IPs, then bringing it back again for the general surfing is a bit of a hell.
2 replies →