← Back to context

Comment by rpcope1

8 days ago

I'm calling it now, this is the beginning of all of the remaining non-commerical properties on the web either going away, or getting hidden inside of some trusted overlay network. Unless the "AI" race slows down or changes or some other act of god happens, the incentives are aligned that I foresee wide swaths of the net getting flogged to death.

Also increasing balkanization of the internet. I now routinely run into sites that geoblock my whole country, this wasn't something I would see more than once or twice a year, and usually only with sites like Walmart that don't care about clients from outside the US.

Now it's 2-5 sites per day, including web forums and such.

  • If you live in Europe it probably has more to do with over-regulation than anything AI-related.

    • More like under-compliance than over-regulation.

      "Bruh sorry we were technically unable to produce a website without invasive dark pattern tracking stuff. Tech is haaaaard."

      Honestly, I've never found a page outside my own country that I couldn't live without. Screw that s*t.

      1 reply →

I self-host a few servers and have not seen significant traffic increases from crawlers, so I can't agree with that without seeing some evidence of this issue's scale and scope. As far as I know it mostly affects commercial content aggregators.

  • It affects many open source projects as well, they just scrape everything repeatedly without abandon.

    First from known networks, then from residential IPs. First with dumb http clients, now with full blown headless chrome browsers.

    • Well I can parse my nginx logs and don't see that happening, so I'm not convinced. I suppose my websites aren't the most discoverable, but the number of bogus connections sshd rejects is an order of magnitude or three higher than the number of unknown connections I get to my web server. Today I received requests from two whole clients from US data centers, so scrapers must be far more selective than you claim, or they are nowhere near the indie web killer OP purports them to be.

      I've worked with a company that has had to invest in scraper traffic mitigation, so I'm not disputing that it happens in high enough volume to be problematic for content aggregators, but as for small independent non-commercial websites I'll stick with my original hypothesis unless I come across contradictory evidence.

Hasn’t that been the case for a while? I’d imagine the combined traffic to all sites on the web combined doesn’t match a single hour of the traffic to the top 5 social media sites. The web is pretty much dead for a while now, many companies don’t even bother maintaining websites anymore

I think the answer for the non-commercial web is to stop worrying.

I understand why certain business models have a problem with AI crawlers, but I fail to see why sites like Codeberg have an issue.

If the problem is cost for the traffic then this is nothing new and I thought we have learned how to handle that by now.

  • The issue is the insane amount of traffic from crawlers that DDOS websites.

    For example: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

    > [...] Now it’s LLMs. If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.

    The linux kernel has also been dealing with it AFAIK. Apparently it's not so easy to deal with, because these ai scrapers pull a lot of tricks to anonymize themselves.

    • One solution is to not expose expensive endpoints in the first place. Serve everything statically, or use heavy caching.

      > Precisely one reason comes to mind to have ROBOTS.TXT, and it is, incidentally, stupid - to prevent robots from triggering processes on the website that should not be run automatically. A dumb spider or crawler will hit every URL linked, and if a site allows users to activate a link that causes resource hogging or otherwise deletes/adds data, then a ROBOTS.TXT exclusion makes perfect sense while you fix your broken and idiotic configuration.

      Source: https://wiki.archiveteam.org/index.php/Robots.txt

      1 reply →

  • About 3 hours ago the codeberg website was really slow.

    Services like codeberg that are run on donations can be easily DOS'ed by AI crawlers

  • One of my semi-personal websites gets crawled by AI crawlers a ton now. I use Bunny.net for a cdn. $20 used to last me for months of traffic, now it only lasts a week or two at most. It's enough that I'm going to go back to not using a cdn and just let the site suffer some slowness every once in a while.

I could see it being the end of commercial and institutional web applications which cannot handle traffic. But actual websites which are html and files in folders served by webservers don't have problems with this.

Could it be a 'correct' continuation of Darwin's survival of the fittest?