And I doubt Facebook implemented something that actually saturates the network, usually a scraper implements a limit on concurrent connections and often also a delay between connections (e.g. max 10 concurrent, 100ms delay).
Chances are the website operator implemented a webserver with terrible RAM efficiency that runs out of RAM and crashes after 10 concurrent requests, or that saturates the CPU from simple requests, or something like that.
I've seen concurrency in excess of 500 from Metas crawlers to a single site. That site had just moved all their images so all the requests hit the "pretty url" rewrite into a slow dynamic request handler. It did not go very well.
Can't every webserver crash due to being overloaded? There's an upper limit to performance of everything. My website is a hobby and has a budget of $4/mo budget VPS.
Perhaps I'm saying crash and you're interpreting that as a bug but really it's just an OOM issue cause of too many in-flight requests. IDK, I don't care enough to handle serving my website at Facebook's scale.
I wouldn't expect it to crash in any case, but I'd generally expect that even an n100 minipc should bottleneck on the network long before you manage to saturate CPU/RAM (maybe if you had 10Gbit you could do it). The linked post indicates they're getting ~2 requests/second from bots, which might as well be zero. Even low powered modern hardware can do thousands to tens of thousands.
Yeah, this is the sort of thing that a caching and rate limiting load balancer (e.g. nginx) could very trivially mitigate. Just add a request limit bucket based on the meta User Agent allowing at most 1 qps or whatever (tune to 20% of your backend capacity), returning 429 when exceeded.
Of course Cloudflare can do all of this for you, and they functionally have unlimited capacity.
I did read the article. I'm skeptical of the claim though. The author was careful to publish specific UAs for the bots, but then provided no extra information of the non-bot UAs.
>If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.
I'm also skeptical of the need for _anyone_ to access the edit history at 10 qps. You could put an nginx rule on those routes that just limits the edit history page to 0.5 qps per IP and 2 qps across all IPs, which would protect your site from both bad AI bots and dumb MediaWiki script kiddies at little impact.
>Oh, and of course, they don’t just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not.
And caching would fix this too, especially for pages that are guaranteed not to change (e.g. an edit history diff page).
Don't get me wrong, I'm not unsympathetic to the author's plight, but I do think that the internet is an unsafe place full of bad actors, and a single bad actor can easily cause a lot of harm. I don't think throwing up your arms and complaining is that helpful. Instead, just apply the mitigations that have existed for this for at least 15 years, and move on with your life. Your visitors will be happier and the bots will get boned.
That's right, getting DDOSed is a skill issue. Just have infinite capacity.
DDOS is different from crashing.
And I doubt Facebook implemented something that actually saturates the network, usually a scraper implements a limit on concurrent connections and often also a delay between connections (e.g. max 10 concurrent, 100ms delay).
Chances are the website operator implemented a webserver with terrible RAM efficiency that runs out of RAM and crashes after 10 concurrent requests, or that saturates the CPU from simple requests, or something like that.
You can doubt all you want, but none of us really know, so maybe you could consider interpreting people's posts a bit more generously in 2025.
I've seen concurrency in excess of 500 from Metas crawlers to a single site. That site had just moved all their images so all the requests hit the "pretty url" rewrite into a slow dynamic request handler. It did not go very well.
Can't every webserver crash due to being overloaded? There's an upper limit to performance of everything. My website is a hobby and has a budget of $4/mo budget VPS.
Perhaps I'm saying crash and you're interpreting that as a bug but really it's just an OOM issue cause of too many in-flight requests. IDK, I don't care enough to handle serving my website at Facebook's scale.
I suspect if the tables were turned and someone managed to crash FB consistently they might not take too kindly to that.
I wouldn't expect it to crash in any case, but I'd generally expect that even an n100 minipc should bottleneck on the network long before you manage to saturate CPU/RAM (maybe if you had 10Gbit you could do it). The linked post indicates they're getting ~2 requests/second from bots, which might as well be zero. Even low powered modern hardware can do thousands to tens of thousands.
You completely ignore the fact that they are also requesting a lot of pages that can be expensive to retrieve/calculate.
10 replies →
The alternative of crawling to a stop isn’t really an improvement.
No normal person has a chance against the capacity of a company like Facebook
Anyone can send 10k concurrent requests with no more than their mobile phone.
Yeah, this is the sort of thing that a caching and rate limiting load balancer (e.g. nginx) could very trivially mitigate. Just add a request limit bucket based on the meta User Agent allowing at most 1 qps or whatever (tune to 20% of your backend capacity), returning 429 when exceeded.
Of course Cloudflare can do all of this for you, and they functionally have unlimited capacity.
Read the article, the bots change their User Agent to an innocuous one when they start being blocked.
And having to use Cloudflare is just as bad for the internet as a whole as bots routinely eating up all available resources.
I did read the article. I'm skeptical of the claim though. The author was careful to publish specific UAs for the bots, but then provided no extra information of the non-bot UAs.
>If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.
I'm also skeptical of the need for _anyone_ to access the edit history at 10 qps. You could put an nginx rule on those routes that just limits the edit history page to 0.5 qps per IP and 2 qps across all IPs, which would protect your site from both bad AI bots and dumb MediaWiki script kiddies at little impact.
>Oh, and of course, they don’t just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not.
And caching would fix this too, especially for pages that are guaranteed not to change (e.g. an edit history diff page).
Don't get me wrong, I'm not unsympathetic to the author's plight, but I do think that the internet is an unsafe place full of bad actors, and a single bad actor can easily cause a lot of harm. I don't think throwing up your arms and complaining is that helpful. Instead, just apply the mitigations that have existed for this for at least 15 years, and move on with your life. Your visitors will be happier and the bots will get boned.