Comment by jhull
19 days ago
> And the best thing of all: they crawl the stupidest pages possible. Recently, both ChatGPT and Amazon were - at the same time - crawling the entire edit history of the wiki. And I mean that - they indexed every single diff on every page for every change ever made.
Is it stupid? It makes sense to scrape all these pages and learn the edits and corrections that people make.
It seems like they just grabbing every possible bit of data available, I doubt there's any mechanism to flag which edits are corrections when training.