Comment by ur-whale
4 hours ago
Why is it no one ever talks about the one thing no one can get their hands on except the big labs ?
I'm talking about the training set.
Sure there are some open sets out there.
But my guess is they are nowhere near what OpenAI, Google and Anthropic are actually using.
Happy to be proven wrong.
I think OpenAI and Anthropic just downloaded the same torrents from Anna's Archive that anyone else can. But it's only OK when they do it. The rest of us get nastygrams from law offices. Anthropic actually had to cough up some bucks, for that matter.
At that point, a lot depends on the quality of the preprocessing applied to the raw text dumps. It is reportedly not that trivial to go from DumpOfSketchyRussianPirateSite.zip to a data set suitable for ingestion during pretraining. A few bad chunks of data can apparently do more harm than one would expect.
AFAIK Google scans almost everything in print as part of the Google Books initiative, so they may have been able to skip the torrenting step.