Comment by danaris

1 month ago

> That's why if you create a bot to scrape, make it not take up more resources than a typical browser based visitor.

Well, right; that's the problem.

They take up orders of magnitude more resources. They absolutely hammer the server. They don't care if your website even survives, so long as they get every single drop of data they can for training.

Source: my own personal experience with them taking down my tiny browser game (~125 unique weekly users—not something of broad general interest!) repeatedly until I locked its Wiki behind a login wall.

2 comments

danaris

expedition32 1 month ago

This is like email were eventually 90% of it was spam and we all got spam filters.

danaris 1 month ago

Except that something effectively equivalent to spam filters will be utterly ineffective here.
Spam filters
- mitigate the symptom (our inboxes being impossible to trawl through for real emails)
- reduce the incentive (because any spam mail that isn't seen by a human being reduces the chances they'll profit from their spamming)
- but does not affect the resource consumption directly (because the email has already been sent through the internet)
Now, this last point barely matters with spam, because sending email requires nearly no resources.
With LLM-training scraper bots, on the other hand, the symptom is the resource consumption. By the time you see their traffic to try to filter it, it's already killing your server. The best you can hope to do is recognize their traffic after a few seconds of firehose and block the IP address.
Then they switch to another one. You block that. They switch to another one.
Residential IPs. Purchased botnet IPs. Constantly rotating IPs.
Unlike spam, there's no reliable way to block an LLM bot that you haven't seen yet, because the only thing that tells you it's a bot is their existing pattern of behavior. And the only unique identifier you can get for them is their IP address.
So how, exactly, are we supposed to filter them effectively, while also allowing legitimate users to access our sites? Especially small-time sites that don't make any money, and thus can't afford to buy CloudFlare or similar protection?