Comment by Szpadel
1 day ago
the worst offender I saw is meta.
they have facebookexternalhit bot (they sometimes use default python request user agent) that (as they documented) explicitly ignores robots.txt
it's (as they say) used to validate links if they contain malware. But if someone would like to serve malware the first thing they would do would be to serve innocent page to facebook AS and their user agent.
they also re-check every URL every month to validate if this still does not contain malware.
the issue is as follows some bad actors spam Facebook with URLs to expensive endpoints (like some search with random filters) and Facebook provides then with free ddos service for your competition. they flood you with > 10 r/s for days every month.
Since when is 10r/s flooding?
That barely registers as a blip even if you're hosting your site on a single server.
In our case this was very heavy specialized endpoint and because each request used different set of parameters could not benefit from caching (actually in this case it thrashed caches with useless entries).
This resulted in upscale. When handling such bot cost more than rest of the users and bots, that's an issue. Especially for our customers with smaller traffic.
This request rate varied from site to site, but it ranged from half to 75% of whole traffic and was basically saturating many servers for days if not blocked.
That depends on what you're hosting. Good luck if it's e.g. a web interface for a bunch of git repositories with a long history. You can't cache effectively because there's too many pages and generating each page isn't cheap.
If you're serving static pages through nginx or something, then 10/sec is nothing. But if you're running python code to generate every page, it can add up fast.