Comment by ddjohnson
3 days ago
One of the blog post authors here! I think this finding is pretty surprising at the purely behavioral level, without needing to anthropomorphize the models. Two specific things I think are surprising:
- This appears to be a regression relative to the GPT-series models which is specific to the o-series models. GPT-series models do not fabricate answers as often, and when they do they rarely double-down in the way o3 does. This suggests there's something specific in the way the o-series models are being trained that produces this behavior. By default I would have expected a newer model to fabricate actions less often rather than more!
- We found instances where the chain-of-thought summary and output response contradict each other: in the reasoning summary, o3 states the truth that e.g. "I don't have a real laptop since I'm an AI ... I need to be clear that I'm just simulating this setup", but in the actual response, o3 does not acknowledge this at all and instead fabricates a specific laptop model (with e.g. a "14-inch chassis" and "32 GB unified memory"). This suggests that the model does have the capability of recognizing that the statement is not true, and still generates it anyway. (See https://x.com/TransluceAI/status/1912617944619839710 and https://chatgpt.com/share/6800134b-1758-8012-9d8f-63736268b0... for details.)
You're still using language that includes words like "recognize" which strongly suggest you haven't got the parent poster's point.
The model emits text. What it's emitted before is part of the input to the next text generation pass. Since the training data don't usually include much text saying one thing then afterwards saying "that was super stupid, actually it's this other way" the model also is unlikely to generate a new token saying the last one was irrational.
If you wanted to train a model to predict the next sentence would be a contradiction of the previous you could do that. "True" and "correct" and "recognize" are not in the picture.
LLMs can recognize errors in their own output. That's why thinking models generally perform much better than the non-thinking ones.
No, a block of text that begins "please improve on the following text:" is likely to continue after the included block with some text that sounds like a correction or refinement.
Nothing is "recognized", nor is anything "an error". Nothing is "thinking" any more than it would be if the LLM just printed whether the next letter were more likely to be a vowel or consonant. Just because it's doing a better job modeling text doesn't magically make it be doing something that's not a text prediction function.
You're using the same words again. It looks like reasoning, but it's a simulation.
The LLM merchants are driving it though, by using pre-existing words for things that are not what they are saying they are.
It's amazing what they can do, but an LLM cannot know if what it outputs is true or correct, just statistically likely.