Comment by danaris

1 month ago

Except that something effectively equivalent to spam filters will be utterly ineffective here.

Spam filters

- mitigate the symptom (our inboxes being impossible to trawl through for real emails)

- reduce the incentive (because any spam mail that isn't seen by a human being reduces the chances they'll profit from their spamming)

- but does not affect the resource consumption directly (because the email has already been sent through the internet)

Now, this last point barely matters with spam, because sending email requires nearly no resources.

With LLM-training scraper bots, on the other hand, the symptom is the resource consumption. By the time you see their traffic to try to filter it, it's already killing your server. The best you can hope to do is recognize their traffic after a few seconds of firehose and block the IP address.

Then they switch to another one. You block that. They switch to another one.

Residential IPs. Purchased botnet IPs. Constantly rotating IPs.

Unlike spam, there's no reliable way to block an LLM bot that you haven't seen yet, because the only thing that tells you it's a bot is their existing pattern of behavior. And the only unique identifier you can get for them is their IP address.

So how, exactly, are we supposed to filter them effectively, while also allowing legitimate users to access our sites? Especially small-time sites that don't make any money, and thus can't afford to buy CloudFlare or similar protection?

0 comments

danaris

No comments yet

Contribute on Hacker News ↗