Comment by thresh

20 hours ago

We had to set it up on the parts of VideoLAN infra so the service would remain usable.

Otherwise it was under a constant DDoS by the AI bots.

23 comments

thresh

Maybe I’m naive about this, but I didn’t expect AI scrapers to be that big of a load? I mean, it’s not that they need to scrape the same at 1000+ QPS, and even then I wouldn’t expect them to download all media and images either?

What am I missing that explains the gap between this and “constant DDoS” of the site?

eks391 1 few seconds ago

You've gotten several comprehensive responses so far and I want to add a niche corner that people might assume might not have the bot problem but still does.
I run a website that hosts tools for my family: games and a TV interface for the kids, remote access to our family cloud and cameras, etc. Sensitive things require log in and have additional parameters required for access of course.
I specifically blocked bots from search engines so my site is never indexed, as I'm not selling anything nor want any attention, as well as some other public non-malicious bots in case they communicate with Google, just to be safe there, and my robots.txt doesn't allow anything.
I assume then, that the only way a bot could even find my site is to do what the indexers do: brute force try every single possible ipv4 address hoping to hear something back, as my domain should not be known (and isn't simple enough to be quickly guessed), and most traffic must be malicious, or indexing (AI overview and other scrapers won't be finding it via web search).
Since it isn't indexing, and keeping everything in simple black and white boxes, my remaining traffic is family or malicious bots, and 99.9% isn't family.
I currently have the most strict bot-blocking setup I could come up with, which nicely cut down on quite a bit of traffic, but I do still receive ~2k attempts per day, which as you can imagine, still is around 99% not traffic, as I have fewer than 20 kids, and my kids aren't using the site nonstop.
Conveniently, my setup has never accidentally blocked a family member, so I'm pleased with the setup.
thresh 18 hours ago
You cant really cache the dynamic content produced by the forges like Gitlab and, say, web forums like phpbb. So it means every request gets through the slow path. Media/JS is of course cached on the edge, so it's not an issue.
Even when the amount of AI requests isnt that high - generally it's in hundreds per second tops for our services combined - that's still a load that causes issues for legitimate users/developers. We've seen it grow from somewhat reasonable to pretty much being 99% of responses we serve.
Can it be solved by throwing more hardware at the problem? Sure. But it's not sustainable, and the reasonable approach in our case is to filter off the parasitic traffic.
- hectormalot 13 hours ago
  
  Thanks, appreciate the details. 99% is far above the amount I expected, and if it specifically hits hard to cache data then I can see how that brings a system to its knees.
- fragmede 17 hours ago
  
  You kind of can though. You serve cached assets and then use JavaScript to modify it for the individual user. The specific user actions can't be cached, but the rest of it can.
  
  4 replies →
nijave 18 hours ago
I think there's a few things at play here
- AI scrapers will pull a bunch of docs from many sites in parallel (so instead of a human request where someone picks a single Google result, it hits a bunch of sites)
- AI will crawl the site looking for the correct answer which may hit a handful of pages
- AI sends requests in quick succession (big bursts instead of small trickle over longer time)
- Personal assistants may crawl the site repeatedly scraping everything (we saw a fair bit of this at work, they announced themselves with user agents)
- At work (b2b SaaS webapp) we also found that the personal assistant variety tended to hammer really computationally expensive data export and reporting endpoints generally without filters. While our app technically supported it, it was very inorganic traffic
That said, I don't think the solution is blanket blocks. Really it's exposing sites are poorly optimized for emerging technology.
- Sesse__ 8 hours ago
  
  Also, relevant for forges: AI doesn't understand what it's clicking on. Git forges tend to e.g. have a lot of links like “download a tarball at this revision” which are super-expensive as far as resources go, and AI crawlers will click on those because they click on every link that looks shiny. (And there are a lot of revisions in a project like VLC!) Much, much more often than humans do.
Y-bar 18 hours ago

They are a scourge, they never rate-limit themselves, there are a hundred of them, and a significant number don’t respect robots.txt. Many of them also end up our meta:no-index,no-follow search pages leading to cost overruns on our Algolia usage. We spend way too much time adjusting WAF and other bot-controls than we should have.
eipi10_hn 14 hours ago
Yes, it's that BIG of a load: https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/
- hectormalot 14 hours ago
  
  Thanks. I imagine there is a (a) a lot of interest in scraping source code, and (b) many requests to forges hitting expensive paths. 99% of volume though, wow, much more than expected.

nijave 18 hours ago

While I do sympathetize with the AI DDoS situation, it'd be nice if there were a solution that allows them to work so they can pull official docs.

For instance, MCP, static sites that are easy to scale, a cache in front of a dynamic site engine

thresh 18 hours ago

Of course, static websites is the best solution to that problem.
Our documentation and a main website are not fronted by this protection, so they're still accessible for the scrapers.

stefantalpalaru 15 hours ago

[dead]

nerdralph 19 hours ago

I highly doubt there is no other technically feasible option to block the AI bots. You end up blocking not just bots, but many humans too. When I clicked on the link and the bot block came up, I just clicked back. I think HN posts should have warnings when the site blocks you from seeing it until you somehow, maybe, prove you are human.

goobatrooba 18 hours ago

I'm sure there are many solutions for many problems, but expecting a small Foss development team to know or implement them all is rather unreasonable.
I think the world gains more if the VLAN team focuses on their amazing, free contribution to the world, than if they spend the same time trying to figure out how to save you two clicks.
We all hate that this is happening, but you don't need to attack everyone that is unfortunately caught up in it.
overfeed 18 hours ago

> I highly doubt there is no other technically feasible option to block the AI bots.
If you have discovered such an option, you could get very wealthy: minimizing friction for humans in e-commerce is valuable. If you're a drive-by critic not vested in the project, then yours is an instance of talk being cheap.
thresh 18 hours ago
I'm all ears on how we can fix it otherwise.
Keep in mind that those kinds of services: - should not be MITMed by CDNs - are generally ran by volunteers with zero budget, money and time-wise
- nerdralph 17 hours ago
  
  First off, don't block the first connection of the day from a given IP. Rate limit/block from there, for example how sshguard does it.
  I've seen several posts on HN and elsewhere showing many bots can be fingerprinted and blocked based on HTTP headers and TLS.
  For the bots that perfectly match the fingerprint of an interactive browser and don't trigger rate limits, use hidden links to tarpits and zip bombs. Many of these have been discussed on HN. Here's the first one that came to memory: https://news.ycombinator.com/item?id=42725147