← Back to context

Comment by NitpickLawyer

9 hours ago

> Without specification, we employ a decoder-only language model GPT2 (Radford et al., 2019) with a configuration of 4 layers, 32 hidden dimensions, and 4 attention heads.

Yeah, ok. The research is interesting, warranted, but writing an article about it, and leading with the conclusions gathered from toy models and implying this generalises to production LLMs is useless.

We've been here before with small models. Training on LLM outputs leads to catastrophic collapse. Every outlet led with this. But no-one red the fine-print, they were testing on small toy models, and were using everything that came out to re-train. Of course it's gonna fail. L3 / phi / gpt-oss models showed that you can absolutely train on synthetic datasets and have great results.

Research in this area is good, and needed. Mainly to understand limitations, discover if there are any scale levels where "emergent" stuff appears and so on. But writing articles based on incipient research, based on tiny models is not worth the effort.

Doing analysis on small models or small data is perfectly valid if the results extrapolate to large models. Which is why right now we're looking at new research papers that are still listing the same small datasets and comparing to the same small models that papers five years ago did.

  • Aren't most major LLMs moving to an architecture where the model is made up of tons of smaller models?

    There's a mountain of reasons why this makes sense from a cost perspective, and seemingly it does also for quality, too, as the newer models train substantially more cheaply and still outperform the older models.

    Naively, this seems like it would be relevant.

  • I have nothing against researching this, I think it's important. My main issue is with articles choosing to grab a "conclusion" and imply it extrapolates to larger models, without any support for that. They are going for the catchy title first, fine-print be damned.

    • I was just at the KDD conference and the general consensus agreed with this paper. There was only one keynoter who just made the assumption that LLMs are associated with reasoning, which was jarring as the previous keynoter had just explained at length why we need a neuro-symbolic approach instead.

      The thing is, I think the current companies making LLMs are _not_ trying to be correct or right. They are just trying to hide it better. In the business future for AI the coding stuff that we focus on on HN - how AI can help/impact us - is just a sideline.

      The huge-money business future of LLMs is to end consumers not creators and it is product and opinion placement and their path to that is to friendship. They want their assistant to be your friend, then your best friend, then your only friend, then your lover. If the last 15 years of social media has been about discord and polarisation to get engagement, the next 15 will be about friendship and love even though that leads to isolation.

      None of this needs the model to grow strong reasoning skills. That's not where the real money is. And CoT - whilst super great - is just as effective if it's hiding better that its giving you the wrong answer (by being more internally consistent) than if its giving you a better answer?

      10 replies →

    • Because model size is a trivial parameter, and not a new paradigm.

      What you're saying is like, you can't extrapolate that long division works on 100 digit numbers because you only worked through it using 7 digit numbers and a few small polynomials.

      5 replies →

  • The extrapolation doesn't work if the transformer is too shallow (too few layers) relative to sequence length, because of https://arxiv.org/abs/2503.03961 . A bunch of tasks become unfeasible when the layer count is too low, and 4 layers is way too low. I.e. linearly increasing the number of layers in a model can result in a superlinear increase in performance on tasks like reasoning.

> conclusions gathered from toy models and implying this generalises to production LLMs is useless

You are just trotting out the tired argument that model size magically fixes the issues, rather than just improves the mirage, and so nothing can be known about models with M parameters by studying models with N < M parameters.

Given enough parameters, a miraculous threshold is reached whereby LLMs switch from interpolating to extrapolating.

Sure!

  • That’s what has been seen in practice though. SOTA LLMs have been shown again and again to solve problems unseen in their data set; and despite their shortcomings they have become extremely useful for a wide variety of tasks.

    • Even a tiny model for, say, classifying hand-written digits, will correctly classify digits that didn't appear in its training data. (Otherwise it wouldn't be very useful.) That classification is interpolative; the hand-written digit is lands in the space of the training data.

      Every result is explainable by has having come from training data. That's the null hypothesis.

      The alternative hypothesis is that it's not explainable as having come from training data. That's a hard-to-believe, hard-to-prove negative.

      You don't get anything out of any computational process that you didn't put in.

      1 reply →

    • Mind linking any examples (or categories) of problems that are definitively not in pre training data but can still be solved by LLMs? Preferably something factual rather than creative, genuinely curious.

      Dumb question but anything like this that’s written about on the internet will ultimately end up as training fodder, no?

      4 replies →

    • > SOTA LLMs have been shown again and again to solve problems unseen in their data set

      We have no idea what the training data is though, so you can't say that.

      > and despite their shortcomings they have become extremely useful for a wide variety of tasks.

      That seems like a separate question.

      1 reply →

I think it is worth writing about simply because it might get the (cost constrained) researcher’s work in front of someone who has the near-unlimited research budgets at one of the big AI companies.

Almost every mention I've seen of gpt-oss was a complaint that the training on synthetic datasets produced a model that's mostly good at benchmarks. Are benchmarks the great results you're referring to or are there a lot of satisfied users out there that just don't post here on HN? Genuinely curious.

I can see how performing well on benchmarks at the expense of everything else counts as great results if that's the point of the model.

>Training on LLM outputs leads to catastrophic collapse. Every outlet led with this. But no-one red the fine-print, they were testing on small toy models, and were using everything that came out to re-train. Of course it's gonna fail. L3 / phi / gpt-oss models showed that you can absolutely train on synthetic datasets and have great results

You're conflating two very different things. Training on synthetic data one time is very different than cyclically training models on their own data. It has nothing to do with model size.

  • Perhaps I worded it poorly. My main point was that articles focus on the wrong thing. Most coverage of that paper was "Using LLM generated data leads to CATASTROPHIC collapse". Without reading the fineprint.

    > [...] cyclically training models on their own data. It has nothing to do with model size.

    Of course it does. GRPO is basically "training models on their own data". You sample, you check for a known truth, you adapt the weights. Repeat. And before GRPO there was RLAIF which showed improving scores at 3 "stages" of generate - select - re-train. With diminishing returns after 3 stages, but no catastrophic collapse.

    My main point was about articles and cherrypicking catchy phrases, not criticising research. We need the research. But we also need good articles that aren't written just for the negativity sells titles.

    cheeky edit: see this thread [1]. I know slashdot has fallen a lot in the last years, but I skimmed the root comments. Not one addressing the "toy" model problem. Everyone reads the title, and reinforces their own biases. That's the main problem I was trying to address.

    1 - https://slashdot.org/story/25/08/11/2253229/llms-simulated-r...

  • "Training on synthetic data one time is very different than cyclically training models on their own data.", but every one with even a modicum of understanding of feedback knows that cyclic training on its own output will end in tears; it's bordering on a tautologic inverse.

    • Is there an actual general principle or theorem or anything that you can link on this? I’m skeptical because these “model collapse” ideas sound vaguely technical and intuitive, but mostly seem to be based on observations about things that happened to happen with current LLMs. It gets bandied about like it is the most obvious thing, but the support mostly seems to be… pseudo-technical vibes.