Comment by wolttam

10 hours ago

> it is clear that actual intelligence has plateaued significantly.

> Moving forward, the industry cannot continue to train bigger and bigger models since their intelligence not only plateaus but often will get worse

These are wild claims - why are we concluding that bigger models and more data = more hallucination? That’s actually the opposite of what’s been happening over the last couple years. Some models may still hallucinate more but they all hallucinate much less than the original 175B ChatGPT which was smaller and trained on (much) less data than anything current.

Edit: My mention of data comes from this quote:

> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling

My take on the current situation: it seems clear that the industry has seen that there is still a lot left to squeeze out of sub-1T models. But for that you do need more, high-quality data in the distribution which you want to unlock capabilities for.

> why are we concluding that bigger models and more data = more hallucination?

That’s not what your quotes said. They said bigger models = plateau in intelligence, nothing about more data or increased hallucinations

The relevant quote for what you’re talking about would be:

> It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer.

So there’s two separate claims: 1) bigger models have plateauing results 2) models trained on larger amounts of factual data have a higher hallucination rate

I’m pretty sure #1 is well known, I think OpenAI’s own research on scaling laws showed diminishing returns on parameter count and training data volume years ago. I don’t know what the support for #2 is besides for the actual post contents.

  • I find these internet arguments talking about LLMs as if they are trained by reading the internet to be wild.

    Yes, pretraining still exists. But for the past few years, pretraining by reading the internet is just the initial bootstrapping of LLM training. The RL training they get from bespoke training data, with very very different characteristics than what these armchair analyses claim, dominates these days.

    • I'd have to imagine there are wildly diminishing marginal returns to additional SFT/post-training passes.

      There are a bounded number of (useful) derivations/combinations of Duff's device.

      If Frontier Labs wish to reduce hallucinations on factual things, they will have to hire people (or the data providers will need to) to do fundamental research above and beyond what is available in extant literature and the web. IE if the LLMs want to lower precision error, they need to go out and actually find more expertise. If the wikipedia page for Pompey lacks data, where are they going to get it from? How would they even _identify_ that the page has holes?

      Yes, they can digitize more books but that is untrustworthy data - if there were enough eyeballs on a particular work, it would be in the internet. If it's not, they'd need to hire the experts themselves. They need expert reviewers in virtually every interesting topic, which fundamentally is an intractable problem, especially since things change all the time. Maybe even uninteresting topics, too?

      I dunno, it doesn't seem to me "more data" is the magic bullet here. Yeah, it will "help" but we're already on the flat part of the S shaped curve.

      My take from trying to understand this stuff is some sort of algorithmic improvement is necessary to get another step change in how well LLMs perform in this area. I could be wrong!

      23 replies →

  • That’s not what your quotes said. They said bigger models = plateau in intelligence, nothing about more data or increased hallucinations ... I’m pretty sure #1 is well known

    Well known in a multiverse branch where Fable was a dud?

    • No, well known in the current multiverse branch where we still occasionally use things like math and scientific analysis instead of people’s vibe checks and pelican SVGs.

      Here’s the paper from OpenAI where Dario himself was a co-author: https://arxiv.org/pdf/2001.08361

      > We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count N, dataset size D, and optimized training computation Cmin, as encapsulated in Equations (1.5) and (1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters. Since scalings with N,D,Cmin are power-laws, there are diminishing returns with increasing scale.

      1 reply →

  • #2 is not that surprising from first principles if the way you made the bigger model was by feeding it poorer quality training data because it’s the only way you can get enough

  • Yeah #2 may be incidental. Suppose one lab focused on bigger, and another on reinforcement training geared towards factual accuracy over sycophancy. You could easily wind up with a model from the second lab that is less powerful but more accurate.

    I can’t prove it but I suspect there’s a bit of that going on.

    • I think one problem is that the models that hallucinate often, a few times out of 8 or 16 so that they get good results on benchmarks, most of which measures success out of top k. From benchmark perspective, you don't really care whether 15 of yours 16 generations failed, as long as one succeeded, but as a user you mostly care that 1 out of 16 you get is actually the successful one. I think this effects is more easy to see on Gemini Flash, it hallucinates like crazy but looks like its by design to boost benchmarks.

      1 reply →

Yeah not only is it totally unsubstantiated, the benchmarks are getting less useful to really show the difference between these models. Big model smell is still a thing and GLM 5.2 while impressive is not Fable class.

Here is something I would like people to chew on. Perhaps the smartest researchers in the world across multiple labs know more about this than we do? Perhaps they are aware of issues like the data wall and diminishing marginal returns. And perhaps they are being honest when they tell you there is no wall?

> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling

I'm pretty sure it's mostly due to the training data quality. No idea, why this never gets mentioned in those discussions.

It was obvious right from the get go, that the scaling law just enabled some abilities, that were described by the underlying data and allowing the ANN to abstract it in the latent space.

Aren't hallucinations also heavily influenced by compute and memory capacity? IE. Companies can spend more time to verify results in an agentic format, spend more thinking tokens, and less quantization. All of these heavily depend on compute and memory but are proven to decrease hallucinations.

Maybe GPT 5.5 is heavily nerfed due to lack of compute, memory, and energy?

I agree that it's farfetched to conclude that bigger models have pleateued.

Isn't that the case of over fitting? You have more data, but when you ask something that's not in that data, hallucinations happen

>> it is clear that actual intelligence has plateaued significantly.

> These are wild claims -

Indeed, it is not clear there was any actual intelligence at any point.

A lot of generated content sure, sometimes even useful, but not necessarily anything more.

  • What is the definition of "actual intelligence"? How does it differ from regular intelligence and non-intelligence?

    If someone can "design a custom asyncio event loop policy in that overrides get_child_watcher()", I would call that person intelligent. Does that mean that person is not actually intelligent but a mere content creation machine?

    Traditionally if you can create content, this shows you're intelligent. Created content is often called "intellectual" property. If a person can understand complex ideas and make connection between them, that is considered intellectual work. You have to be intelligent to do intellectual work. If a person can solve problems, this is also called intelligence. If the person can solve more complex problems, that person is said to have higher intelligence. This is often measured with a scale called IQ (Intelligence Quotient). There are other types of intelligence but they are basically the variations of the same ability. Most definitions of intelligence also involve an ability to adapt into the environment.

    Since intelligence is such a broad concept what exactly is the difference between the actual intelligence and AI, other than one is natural and the other one is artificial?

    I understand being anti-AI because of the very real societal concerns. But ignoring what is in front of you is not a solution.

>These are wild claims - why are we concluding that bigger models and more data = more hallucination?

Because that's what they measured in this case.

to train models to be smarter than they are, one needs examples and cases to train on, and once you get close to the top percentiles of human reasoning there is extremely little such material available.

You can create contrived logic problems, but they often turn into language games because English is not formal logic.

And you can train on "monty hall" style problems, but those too are language games that are intriguing to humans but obvious when framed slightly differently.

In other words, model trainers are fighting against the overwhelming mediocrity of the training corpus (all of the recorded human output from history).

As models improve, the next phase will be models co-designed with humans to overcome these limits. The way we use language and the process we use to problem solve (we currently call this "orchestration") will evolve as part of this. Meatspace metaphors map badly when we have massive context and don't need the same limits. How different is hallucination from extrapolation, etc.

Much of the skepticism and confusion about LLMs is no different than a person of average intelligence hearing a highly intelligent person explain something and considering the explanation gibberish, then arrogantly accusing the intelligent person of being unhelpful.

Much like dogs were domesticated from wolves to have traits that make them good around humans, LLMs will evolve around our limits, around our arrogance, around our aesthetic biases and prejudices. Intelligence and rationality is fundamentally not what most humans want from an LLM.

My impression is that the fundamental issue is that LLMs attempt to extract reasoning (executive execution) from data (relationship between tokens).

There's an open question about whether this is theoretically possible, but it doesn't seem like it to me.

Human generated data is an effect of reasoning. Attempting to extract executive function from it is kind of like taking an anti-derivative of a function.

This has always seemed like the root of hallucinations to me. It sort of follows the parallels to lossy compression that a lot of people draw. You're extracting some characteristics by observing the relationship between tokens, and then trying to argue that those characteristics are equivalent to the thing that generated the original tokens.

Surely there's some sort of overlap there, but viewed that way, it seems obvious that more and more parameters and scaling won't solve the fundamental problem. There's only so much meaning you can extract from token relationships.

It's like trying to derive the shape of a flame from the smoke it produces.

The original intelligence that created those tokens was driven by a whole universe of inputs, from hormones to starlight to gravity, not to mention all of the strange things about consciousness and parapsychology that is so poorly understood.

The machines are definitely useful for a certain class of tasks - those that don't require much executive function, and the useful work mostly involves pattern matching.

The problem is, we seem to be mistaking effect for cause and imagining that these things have greater capabilities than they'll ever posess.

The investors that don't understand this are indeed going to learn a bitter lesson.

you mixed two random quotes from the article to create a strawman.

ofcourse you knew what you were doing but disappointing that this was top comment.

In cognitive science, it appears your brain has two modes of thinking:

- A very parallel type of computation that is fast and generally accurate and integrates hundreds of variables. It’s sometimes labeled as intuition or system 1 thinking.

- A much slower, step by step, analytical type, commonly linked with your pre-frontal cortex (one of the newest parts of the brain). Sometimes called system 2 thinking.

Maybe the way the universe works is that all computation more or less is one of those two types. In which case, an LLM alone is only the first part, which is often right but its results also cannot ever be proven.

  • An LLM is not thinking, assuming and relating it to thought and universal truths is nonsense.

    • We inflicted that to ourselves by picking the most confusing terminology ever. "No, reasoning isn't thinking. No when the model says it thinks it's not actually thinking... No an agent isn't actually a creature with agency... No, when we say it hallucinates it doesn't, like, actually hallucinate"

      1 reply →