Comment by Hizonner
3 days ago
That sounds like a really dumb scraper indeed. I don't think you'd want to feed very many diffs into a training run or most inference runs.
But if there's a (discoverable) page comparing every revision of a page to every other revision, and a page has N revisions, there are going to be (N^2-N)/2 delta pages, so could it just be the majority of the distinct pages your Wiki has are deltas?
I would think that by now the "AI companies" would have something smarter steering their scrapers. Like, I dunno, some kind of AI. But maybe they don't for some reason? Or maybe the big ones do, but smaller "hungrier" ones, with less staff but still probably with a lot of cash, are willing to burn bandwidth so they don't have to implement that?
The questions just multiply.
It's near-stock mediawiki, so it has a ton of old versions and diffs off the history tab but I'd expect a crawler to be able to handle it.