Comment by mentalgear

6 months ago

Note-worthy from the article (as some commentators suggested blocking them).

"If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet."

26 comments

mentalgear

IanKerr 6 months ago

This is the beginning of the end of the public internet, imo. Websites that aren't able to manage the bandwidth consumption of AI scrapers and the endless spam that will take over from LLMs writing comments on forums are going to go under. The only things left after AI has its way will be walled gardens with whitelisted entrants or communities on large websites like Facebook. Niche, public sites are going to become unsustainable.

oblio 6 months ago

Classic spam all but killed small email hosts, AI spam will kill off the web.
Super sad.
raphman 6 months ago

Yeah. Our research group has a wiki with (among other stuff) a list of open, completed, and ongoing bachelor's/master's theses. Until recently, the list was openly available. But AI bots caused significant load by crawling each page hundreds of times, following all links to tags (which are implemented as dynamic searches), prior revisions, etc. Since a few weeks, the pages are only available to authenticated users.

loeg 6 months ago

I'd kind of like to see that claim substantiated a little more. Is it all crawlers that switch to a non-bot UA, or how are they determining it's the same bot? What non-bot UA do they claim?

denschub 6 months ago
> Is it all crawlers that switch to a non-bot UA
I've observed only one of them do this with high confidence.
> how are they determining it's the same bot?
it's fairly easy to determine that it's the same bot, because as soon as I blocked the "official" one, a bunch of AWS IPs started crawling the same URL patterns - in this case, mediawiki's diff view (`/wiki/index.php?title=[page]&diff=[new-id]&oldid=[old-id]`), that absolutely no bot ever crawled before.
> What non-bot UA do they claim?
Latest Chrome on Windows.
- loeg 6 months ago
  
  Thanks.
untitaker_ 6 months ago

Presumably they switch UA to Mozilla/something but tell on themselves by still using the same IP range or ASN. Unfortunately this has become common practice for feed readers as well.
alphan0n 6 months ago
I would take anything the author said with a grain of salt. They straight up lied about the configuration of the robots.txt file.
https://news.ycombinator.com/item?id=42551628
- ribadeo 6 months ago
  
  How do you know what the contextual configuration of their robots.txt is/was?
  Your accusation was directly addressed by the author in a comment on the original post, IIRC
  i find your attitude as expressed here to be problematic in many ways
  
  4 replies →
- mplewis 6 months ago
  
  What is causing you to be so unnecessarily aggressive?
  
  9 replies →

aaroninsf 6 months ago

I instigated `user-agent`-based rate limiting for exactly this reason, exactly this case.

These bots were crushing our search infrastructure (which is tightly coupled to our front end).

optimalsolver 6 months ago

Ban evasion for me, but not for thee.

pacifika 6 months ago

So you get all the IPs by rate limiting them?