Comment by thresh
20 hours ago
You cant really cache the dynamic content produced by the forges like Gitlab and, say, web forums like phpbb. So it means every request gets through the slow path. Media/JS is of course cached on the edge, so it's not an issue.
Even when the amount of AI requests isnt that high - generally it's in hundreds per second tops for our services combined - that's still a load that causes issues for legitimate users/developers. We've seen it grow from somewhat reasonable to pretty much being 99% of responses we serve.
Can it be solved by throwing more hardware at the problem? Sure. But it's not sustainable, and the reasonable approach in our case is to filter off the parasitic traffic.
Thanks, appreciate the details. 99% is far above the amount I expected, and if it specifically hits hard to cache data then I can see how that brings a system to its knees.
You kind of can though. You serve cached assets and then use JavaScript to modify it for the individual user. The specific user actions can't be cached, but the rest of it can.
Totally. Remember slashdot in the 1990s used to house a dynamic page on a handful of servers with horsepower dwarfed by a Nintendo Switch that had a user base capable of bringing major properties down.
The "can't" comes from the fact that VLC is not going to rewrite their forum software or software forge.
Software written in PHP is in most cases frankly still abysmally slow and inefficient. Wordpress runs like 70% of the web and you can really feel it from the 1500ms+ TFFB most sites have. PhpBB is not much better. Pathetic throughput at best and it has not gotten better in decades now.
I don't know how GitLab became so disgustingly slow. But yeah, I'm not surprised bots can easily bring it to its knees.
> Wordpress runs like 70% of the web and you can really feel it from the 1500ms+ TFFB most sites have. PhpBB is not much better.
At least phpBB died 15 years ago with most communities migrating to Xenforo. I'm not quite sure how or why WP is still around with so many SSGs and SaaS site builders floating around these days.
The funniest part about WordPress is that you can usually achieve at least a 50% speed boost or more by adding a plugin that just minifies and caches the ridiculous number of dynamic CSS and JS files that most themes and plugins add to every page. Set those up with HTTP 103 Early Hints preload headers (so the browser can start sending subresource requests in the background before the HTML is even sent out, exactly the kind of thing HTTP/2 and /3 were designed to make possible) and then throw Cloudflare or another decent CDN on top, and you're suddenly getting TTFBs much closer to a more "modern" stack.
The bizarre thing is that pretty much no CMS, even the "new" ones, seems to automate all of that by default. None of those steps are that difficult to implement, and provide a serious speed boost to everything from WordPress to MediaWiki in my experience, and yet the only service that seems to get close to offering it is Cloudflare.
Even then, Cloudflare's tooling only works its best if you're already emitting minified and compressed files and custom written preload headers on the origin side, since the hit on decompressing all the origin traffic to make those adjustments and analyses is way worse for performance than just forwarding your compressed responses directly, hence why they removed Auto Minify[1] and encourage sending pre-compressed Brotli level 11 responses from the origin[2] so people on recent browsers get pass-through compression without extra cycles being spent on Cloudflare's servers.
The solution seems pretty clear: aim to get as much stuff served statically, preferably pre-compressed, as you can. But it's still weird that actually implementing that is still a manual process on most CMSes, when it shouldn't be that hard to make it a standard feature.
And as for Git web interfaces, the correct solution is to require logins to view complete history. Nobody likes saying it, nobody likes hearing it. But Git is not efficient enough on its own to handle the constant bombardment of random history paginations and diffs that AI crawlers seem to love. It wasn't an issue before, because old crawlers for things like search engines were smart enough to ignore those types of pages, or at least to accept when the sysadmin says it should ignore those types of pages. AI crawlers have no limits, ignore signals from site operators, make no attempts to skip redundant content, and in general are very dumb about how they send requests (this is a large part of why Anubis works so well; it's not a particularly complex or hard to bypass proof of work system[3], but AI bots genuinely don't care about anything but consuming as many HTTP 200s as a server can return, and give up at the slightest hint of pushback (but do at least try randomizing IPs and User-Agents, since those are effectively zero-cost to attempt).
[1]: https://www.joelonsoftware.com/2000/05/12/strategy-letter-i-... - still probably the single greatest blog post ever written, 26 years later.
1 reply →