Comment by weinzierl

3 months ago

I think the answer for the non-commercial web is to stop worrying.

I understand why certain business models have a problem with AI crawlers, but I fail to see why sites like Codeberg have an issue.

If the problem is cost for the traffic then this is nothing new and I thought we have learned how to handle that by now.

5 comments

weinzierl

myaccountonhn 3 months ago

The issue is the insane amount of traffic from crawlers that DDOS websites.

For example: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

> [...] Now it’s LLMs. If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.

The linux kernel has also been dealing with it AFAIK. Apparently it's not so easy to deal with, because these ai scrapers pull a lot of tricks to anonymize themselves.

1gn15 3 months ago
One solution is to not expose expensive endpoints in the first place. Serve everything statically, or use heavy caching.
> Precisely one reason comes to mind to have ROBOTS.TXT, and it is, incidentally, stupid - to prevent robots from triggering processes on the website that should not be run automatically. A dumb spider or crawler will hit every URL linked, and if a site allows users to activate a link that causes resource hogging or otherwise deletes/adds data, then a ROBOTS.TXT exclusion makes perfect sense while you fix your broken and idiotic configuration.
Source: https://wiki.archiveteam.org/index.php/Robots.txt
- jakub_g 3 months ago
  
  Several years ago, GitHub started moving certain features like "code search on public repos" behind login, likely due to issues like this, to be able to better enforce rate limits. And this was before the era of LLMs going wild.
  (And it led to outrage from people for whom requiring an account was some kind of insult.)

MYEUHD 3 months ago

About 3 hours ago the codeberg website was really slow.

Services like codeberg that are run on donations can be easily DOS'ed by AI crawlers

johntash 3 months ago

One of my semi-personal websites gets crawled by AI crawlers a ton now. I use Bunny.net for a cdn. $20 used to last me for months of traffic, now it only lasts a week or two at most. It's enough that I'm going to go back to not using a cdn and just let the site suffer some slowness every once in a while.