← Back to context

Comment by an0malous

11 hours ago

> why are we concluding that bigger models and more data = more hallucination?

That’s not what your quotes said. They said bigger models = plateau in intelligence, nothing about more data or increased hallucinations

The relevant quote for what you’re talking about would be:

> It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer.

So there’s two separate claims: 1) bigger models have plateauing results 2) models trained on larger amounts of factual data have a higher hallucination rate

I’m pretty sure #1 is well known, I think OpenAI’s own research on scaling laws showed diminishing returns on parameter count and training data volume years ago. I don’t know what the support for #2 is besides for the actual post contents.

I find these internet arguments talking about LLMs as if they are trained by reading the internet to be wild.

Yes, pretraining still exists. But for the past few years, pretraining by reading the internet is just the initial bootstrapping of LLM training. The RL training they get from bespoke training data, with very very different characteristics than what these armchair analyses claim, dominates these days.

  • I'd have to imagine there are wildly diminishing marginal returns to additional SFT/post-training passes.

    There are a bounded number of (useful) derivations/combinations of Duff's device.

    If Frontier Labs wish to reduce hallucinations on factual things, they will have to hire people (or the data providers will need to) to do fundamental research above and beyond what is available in extant literature and the web. IE if the LLMs want to lower precision error, they need to go out and actually find more expertise. If the wikipedia page for Pompey lacks data, where are they going to get it from? How would they even _identify_ that the page has holes?

    Yes, they can digitize more books but that is untrustworthy data - if there were enough eyeballs on a particular work, it would be in the internet. If it's not, they'd need to hire the experts themselves. They need expert reviewers in virtually every interesting topic, which fundamentally is an intractable problem, especially since things change all the time. Maybe even uninteresting topics, too?

    I dunno, it doesn't seem to me "more data" is the magic bullet here. Yeah, it will "help" but we're already on the flat part of the S shaped curve.

    My take from trying to understand this stuff is some sort of algorithmic improvement is necessary to get another step change in how well LLMs perform in this area. I could be wrong!

    • As a side gig, I write novel software that solves problems no existing software does, that existing LLMs have difficulty reproducing, purely for the purpose of existing as LLM training data.

      There are journalists being hired to write Atlantic-worthy articles that exist only as LLM training data, because they're getting paid more than the Atlantic would pay them for it.

      It's insane.

      Yes, they are hiring the experts themselves. To create new knowledge above and beyond what's on the internet. To be locked away as LLM training data.

      The largest characteristic of all of this new data is it is targeted at LLM's weak points.

      It's not just more data, it's custom tutorials built for what LLMs struggle at.

      25 replies →

    • >> They need expert reviewers in virtually every interesting topic, which fundamentally is an intractable problem, especially since things change all the time.

      How odd. It's Expert Systems and the Knowledge Acquisition Bottleneck all over again.

  • Where do they get the bespoke training data from? And how much? I don’t really know anything about this.

    • > And how much?

      Mercor, one of the larger vendors for contracting with experts to create bespoke data, says on their webpage they're paying $3M/day to their contractors for data.

      So well into the billions of dollars a year for bespoke training data.

      That's also ignoring the RLVR data labs can get from software - they can use the vibe coding sessions as training data as well without paying more.

      They are just one of many.

That’s not what your quotes said. They said bigger models = plateau in intelligence, nothing about more data or increased hallucinations ... I’m pretty sure #1 is well known

Well known in a multiverse branch where Fable was a dud?

  • No, well known in the current multiverse branch where we still occasionally use things like math and scientific analysis instead of people’s vibe checks and pelican SVGs.

    Here’s the paper from OpenAI where Dario himself was a co-author: https://arxiv.org/pdf/2001.08361

    > We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count N, dataset size D, and optimized training computation Cmin, as encapsulated in Equations (1.5) and (1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters. Since scalings with N,D,Cmin are power-laws, there are diminishing returns with increasing scale.

    • instead of people’s vibe checks and pelican SVGs.

      Right, what happened is everyone went to Fable and asked it to make the very best bicycle pelican SVG, no mistakes. And Fable's bicycle pelican SVGs were such timeless masterpieces, we all instantly got AI psychosis. Happily, you were immune to this.

#2 is not that surprising from first principles if the way you made the bigger model was by feeding it poorer quality training data because it’s the only way you can get enough

Yeah #2 may be incidental. Suppose one lab focused on bigger, and another on reinforcement training geared towards factual accuracy over sycophancy. You could easily wind up with a model from the second lab that is less powerful but more accurate.

I can’t prove it but I suspect there’s a bit of that going on.

  • I think one problem is that the models that hallucinate often, a few times out of 8 or 16 so that they get good results on benchmarks, most of which measures success out of top k. From benchmark perspective, you don't really care whether 15 of yours 16 generations failed, as long as one succeeded, but as a user you mostly care that 1 out of 16 you get is actually the successful one. I think this effects is more easy to see on Gemini Flash, it hallucinates like crazy but looks like its by design to boost benchmarks.

    • > it hallucinates like crazy but looks like its by design to boost benchmarks.

      Wasn’t there a discussion around some new-ish benchmark _punishing_ hallucination answers (over not replying at all) recently? Maybe in the not-so-distant future, this “spam replies until one’s correct” strategy won’t be able to game a benchmark much at all anymore.