Comment by Hizonner

3 days ago

> The difference between that and the LLM training data scraping

Is the traffic that people are complaining about really training traffic?

My SWAG would be that there are maybe on the order of dozens of foundation models trained in a year. If you assume the training runs are maximally inefficient, cache nothing, and crawl every Web site 10 times for each model trained, then that means maybe a couple of hundred full-content downloads for each site in a year. But really they probably do cache, and they probably try to avoid downloading assets they don't actually want to put into the training hopper, and I'm not sure how many times they feed any given page through in a single training run.

That doesn't seem like enough traffic to be a really big problem.

On the other hand, if I ask ChatGPT Deep Research to give me a report on something, it runs around the Internet like a ferret on meth and maybe visits a couple of hundred sites (but only a few pages on each site). It'll do that a whole lot faster than I'd do it manually, it's probably less selective about what it visits than I would be... and I'm likely to ask for a lot more such research from it than I'd be willing to do manually. And the next time a user asks for a report, it'll do it again, often on the same sites, maybe with caching and maybe not.

Thats not training; the results won't be used to update any neural network weights, and won't really affect anything at all beyond the context of a single session. It's "inference scraping" if you will. It's even "user traffic" in some sense, although not in the sense that there's much chance the user is going to see a site's advertising. It's conceivable the bot might check the advertising for useful information, but of course the problem there is that it's probably learned that's a waste of time.

Not having given it much thought, I'm not sure how that distinction affects the economics of the whole thing, but I suspect it does.

So what's really going on here? Anybody actually know?

The traffic I've seen is the big AI players just voraciously scraping for ~everything. What they do with it, if anything, who knows.

There's some user-directed traffic, but it's a small fraction, in my experience.

The traffic I'm seeing on a wiki I host looks like plain old scraping. When it hits it's a steady load of lots of traffic going all over, from lots of IPs. And they really like diffs between old page revisions for some reason.

  • That sounds like a really dumb scraper indeed. I don't think you'd want to feed very many diffs into a training run or most inference runs.

    But if there's a (discoverable) page comparing every revision of a page to every other revision, and a page has N revisions, there are going to be (N^2-N)/2 delta pages, so could it just be the majority of the distinct pages your Wiki has are deltas?

    I would think that by now the "AI companies" would have something smarter steering their scrapers. Like, I dunno, some kind of AI. But maybe they don't for some reason? Or maybe the big ones do, but smaller "hungrier" ones, with less staff but still probably with a lot of cash, are willing to burn bandwidth so they don't have to implement that?

    The questions just multiply.

    • It's near-stock mediawiki, so it has a ton of old versions and diffs off the history tab but I'd expect a crawler to be able to handle it.