← Back to context

Comment by CqtGLRGcukpy

9 days ago

The AI companies won't just scrape IA once, they're keeping come back to the same pages and scraping them over and over. Even if nothing has changed.

This is from my experience having a personal website. AI companies keep coming back even if everything is the same.

Weird, considering IA has most of its content in a way you could rehost it all idk why nobody’s just hosting a IA carbon copy that AI companies can hit endlessly, and then cutting IA a nice little check in the process, but I guess some of the wealthiest AI startups are very frugal about training data?

This also goes back to something I said long ago, AI companies are relearning software engineering poorly. I can think of so many ways to speed up AI crawlers, im surprised someone being paid 5x my salary cannot.

  • That already exists, it's called Common Crawl[1], and it's a huge reason why none of this happened prior to LLMs coming on the scene, back when people were crawling data for specialized search engines or academic research purposes.

    The problem is that AI companies have decided that they want instant access to all data on Earth the moment that it becomes available somewhere, and have the infrastructure behind them to actually try and make that happen. So they're ignoring signals like robots.txt or even checking whether the data is actually useful to them (they're not getting anything helpful out of recrawling the same search results pagination in every possible permutation, but that won't stop them from trying, and knocking everyone's web servers offline in the process) like even the most aggressive search engine crawlers did, and are just bombarding every single publicly reachable server with requests on the off chance that some new data fragment becomes available and they can ingest it first.

    This is also, coincidentally, why Anubis is working so well. Anubis kind of sucks, and in a sane world where these companies had real engineers working on the problem, they could bypass it on every website in just a few hours by precomputing tokens.[2] But...they're not. Anubis is actually working quite well at protecting the sites it's deployed on despite its relative simplicity.

    It really does seem to indicate that LLM companies want to just throw endless hardware at literally any problem they encounter and brute force their way past it. They really aren't dedicating real engineering resources towards any of this stuff, because if they were, they'd be coming up with way better solutions. (Another classic example is Claude Code apparently using React to render a terminal interface. That's like using the space shuttle for a grocery run: utterly unnecessary, and completely solvable.) That's why DeepSeek was treated like an existential threat when it first dropped: they actually got some engineers working on these problems, and made serious headway with very little capital expenditure compared to the big firms. Of course they started freaking out, their whole business model is based on the idea that burning comical amounts of money on hardware is the only way we can actually make this stuff work!

    The whole business model backing LLMs right now seems to be "if we burn insane amounts of money now, we can replace all labor everywhere with robots in like a decade", but if it turns out that either of those things aren't true (either the tech can be improved without burning hundreds of billions of dollars, or the tech ends up being unable to replace the vast majority of workers) all of this is going to fall apart.

    Their approach to crawling is just a microcosm of the whole industry right now.

    [1]: https://news.ycombinator.com/item?id=45787775

    • Thanks for the mention of Common Crawl. We do respect robots.txt and we publish an opt-out list, due to the large number of publishers asking to opt out recently.

      There's a bit of discussion of Common Crawl in Jeff Jarvis's testimony before Congress: https://www.youtube.com/watch?v=tX26ijBQs2k

    • So perhaps the AI companies will go bankrupt and then this madness will stop. But it would be nice if no government intervenes because they are "too big to fail".

    • Are you sure it's the AI companies being that incompetent, and not wannabe AI companies?

      What I feel is a lot more likely is that OpenAI et al are running a pretty tight ship, whereas all the other "we will scrape the entire internet and then sell it to AI companies for a profit" businesses are not.

      2 replies →

yeah, they should really have a think about how their behavior is harming their future prospects here.

Just because you have infinite money to spend on training doesn't mean you should saturate the internet with bots looking for content with no constraints - even if that is a rounding error of your cost.

We just put heavy constraints on our public sites blocking AI access. Not because we mind AI having access - but because we can't accept the abusive way they execute that access.

  • Something I’ve noticed about technology companies, and it’s bled into just about every facet of the US these days, is the consideration of if an action *can* be executed upon vs *should* an action be executed upon.

    It’s very unfortunate and a short sighted way to operate.

  • The main issue is a well behaved AI company won't be singled out for continued access, they will all be hit by public sites blocking AI access. So there is no benefit to them behaving.

    • > So there is no benefit to them behaving.

      That's assuming they're deriving a benefit from misbehaving.

      There is no benefit to immediately re-crawling 404s or following dynamic links into a rabbit hole of machine-generated junk data and empty search results pages in violation of robots.txt. They're wasting the site's bandwidth and their own in order to get trash they don't even want.

      Meanwhile there is an obvious benefit to behaving: You don't, all by yourself, cause public sites to block everyone including you.

      The problem here isn't malice, it's incompetence.

    • Why should a well-behaved AI company be singled out for continued access? If the industry can't regulate itself then none deserve access no matter if they're "well-behaved".

      Receiving a response from someone's webserver is a privilege, not a right.

    • Honestly, has any of these AI companies ever offered a compensation for the data they pillage, except in case of large walled up information silos like reddit? This is like asking why the occasional burglars are not singled out for direct access into your house, compared to the stripmining marauders out there.

      Why does any of them deserve any special treatment? Please don't try to normalize this reprehensible behavior. It's a greedy, exploitative and lawless behavior, no matter how much they downplay it or how long they've been doing it.

      1 reply →

It’s insane actually how fast to re-request the same pages, even 404s. They’re so desperate for data they’re really hurting smaller hosts. One of our clients site became unusable when one of the ai bots started spamming the Wordpress search for terms that I’m guessing users were searching for but were unrelated to the sites content. Instead of building a search index they’re just hammering sites directly. So annoying.

It can be 10,000 requests a day on static HTML and non-existent, PHP pages. That's on my site. I'd rather them have Christ-centered and helpful content in their pretraining. So, I still let them scrape it for the public good.

It helps to not have images, etc that would drive up bandwidth cost. Serving HTML is just pennies a month with BunnyCDN. If I had heavier content, I might have to block them or restrict it to specific pages once per day. Maybe just block the heavy content, like the images.

Btw, anyone tried just blocking things like images to see if scaping bandwidth dropped to acceptable levels?

> The AI companies won't just scrape IA once, they're keeping come back to the same pages and scraping them over and over. Even if nothing has changed.

Why, though? Especially if the pages are new; aren't they concerned about ingesting AI-generated content?

  • Possibly because a lot of “AI-company scraping” isn't traditional scraping (e.g., to build a dataset of the state at a particular point in time), its referencing the current content of the page as grounding for the response to a user request.