Comment by NitpickLawyer
1 day ago
Perhaps I worded it poorly. My main point was that articles focus on the wrong thing. Most coverage of that paper was "Using LLM generated data leads to CATASTROPHIC collapse". Without reading the fineprint.
> [...] cyclically training models on their own data. It has nothing to do with model size.
Of course it does. GRPO is basically "training models on their own data". You sample, you check for a known truth, you adapt the weights. Repeat. And before GRPO there was RLAIF which showed improving scores at 3 "stages" of generate - select - re-train. With diminishing returns after 3 stages, but no catastrophic collapse.
My main point was about articles and cherrypicking catchy phrases, not criticising research. We need the research. But we also need good articles that aren't written just for the negativity sells titles.
cheeky edit: see this thread [1]. I know slashdot has fallen a lot in the last years, but I skimmed the root comments. Not one addressing the "toy" model problem. Everyone reads the title, and reinforces their own biases. That's the main problem I was trying to address.
1 - https://slashdot.org/story/25/08/11/2253229/llms-simulated-r...
If you have a ground truth that you're comparing to, that's not training on your own data.