Comment by rpcope1

3 months ago

I'm calling it now, this is the beginning of all of the remaining non-commerical properties on the web either going away, or getting hidden inside of some trusted overlay network. Unless the "AI" race slows down or changes or some other act of god happens, the incentives are aligned that I foresee wide swaths of the net getting flogged to death.

16 comments

rpcope1

homebrewer 3 months ago

Also increasing balkanization of the internet. I now routinely run into sites that geoblock my whole country, this wasn't something I would see more than once or twice a year, and usually only with sites like Walmart that don't care about clients from outside the US.

Now it's 2-5 sites per day, including web forums and such.

bananalychee 3 months ago
If you live in Europe it probably has more to do with over-regulation than anything AI-related.
- black_puppydog 3 months ago
  
  More like under-compliance than over-regulation.
  "Bruh sorry we were technically unable to produce a website without invasive dark pattern tracking stuff. Tech is haaaaard."
  Honestly, I've never found a page outside my own country that I couldn't live without. Screw that s*t.
  
  1 reply →

bananalychee 3 months ago

I self-host a few servers and have not seen significant traffic increases from crawlers, so I can't agree with that without seeing some evidence of this issue's scale and scope. As far as I know it mostly affects commercial content aggregators.

_ikke_ 3 months ago
It affects many open source projects as well, they just scrape everything repeatedly without abandon.
First from known networks, then from residential IPs. First with dumb http clients, now with full blown headless chrome browsers.
- bananalychee 3 months ago
  
  Well I can parse my nginx logs and don't see that happening, so I'm not convinced. I suppose my websites aren't the most discoverable, but the number of bogus connections sshd rejects is an order of magnitude or three higher than the number of unknown connections I get to my web server. Today I received requests from two whole clients from US data centers, so scrapers must be far more selective than you claim, or they are nowhere near the indie web killer OP purports them to be.
  I've worked with a company that has had to invest in scraper traffic mitigation, so I'm not disputing that it happens in high enough volume to be problematic for content aggregators, but as for small independent non-commercial websites I'll stick with my original hypothesis unless I come across contradictory evidence.

herval 3 months ago

Hasn’t that been the case for a while? I’d imagine the combined traffic to all sites on the web combined doesn’t match a single hour of the traffic to the top 5 social media sites. The web is pretty much dead for a while now, many companies don’t even bother maintaining websites anymore

weinzierl 3 months ago

I think the answer for the non-commercial web is to stop worrying.

I understand why certain business models have a problem with AI crawlers, but I fail to see why sites like Codeberg have an issue.

If the problem is cost for the traffic then this is nothing new and I thought we have learned how to handle that by now.

myaccountonhn 3 months ago
The issue is the insane amount of traffic from crawlers that DDOS websites.
For example: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
> [...] Now it’s LLMs. If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.
The linux kernel has also been dealing with it AFAIK. Apparently it's not so easy to deal with, because these ai scrapers pull a lot of tricks to anonymize themselves.
- 1gn15 3 months ago
  
  One solution is to not expose expensive endpoints in the first place. Serve everything statically, or use heavy caching.
  > Precisely one reason comes to mind to have ROBOTS.TXT, and it is, incidentally, stupid - to prevent robots from triggering processes on the website that should not be run automatically. A dumb spider or crawler will hit every URL linked, and if a site allows users to activate a link that causes resource hogging or otherwise deletes/adds data, then a ROBOTS.TXT exclusion makes perfect sense while you fix your broken and idiotic configuration.
  Source: https://wiki.archiveteam.org/index.php/Robots.txt
  
  1 reply →
MYEUHD 3 months ago

About 3 hours ago the codeberg website was really slow.
Services like codeberg that are run on donations can be easily DOS'ed by AI crawlers
johntash 3 months ago

One of my semi-personal websites gets crawled by AI crawlers a ton now. I use Bunny.net for a cdn. $20 used to last me for months of traffic, now it only lasts a week or two at most. It's enough that I'm going to go back to not using a cdn and just let the site suffer some slowness every once in a while.

superkuh 3 months ago

I could see it being the end of commercial and institutional web applications which cannot handle traffic. But actual websites which are html and files in folders served by webservers don't have problems with this.

v5v3 3 months ago

Could it be a 'correct' continuation of Darwin's survival of the fittest?