Comment by syllogism
8 hours ago
It's interesting that there's still such a market for this sort of take.
> In a recent pre-print paper, researchers from the University of Arizona summarize this existing work as "suggest[ing] that LLMs are not principled reasoners but rather sophisticated simulators of reasoning-like text."
What does this even mean? Let's veto the word "reasoning" here and reflect.
The LLM produces a series of outputs. Each output changes the likelihood of the next output. So it's transitioning in a very large state space.
Assume there exists some states that the activations could be in that would cause the correct output to be generated. Assume also that there is some possible path of text connecting the original input to such a success state.
The reinforcement learning objective reinforces pathways that were successful during training. If there's some intermediate calculation to do or 'inference' that could be drawn, writing out a new text that makes that explicit might be a useful step. The reinforcement learning objective is supposed to encourage the model to learn such patterns.
So what does "sophisticated simulators of reasoning-like text" even mean here? The mechanism that the model uses to transition towards the answer is to generate intermediate text. What's the complaint here?
It makes the same sort of sense to talk about the model "reasoning" as it does to talk about AlphaZero "valuing material" or "fighting for the center". These are shorthands for describing patterns of behaviour, but of course the model doesn't "value" anything in a strictly human way. The chess engine usually doesn't see a full line to victory, but in the games it's played, paths which transition through states with material advantage are often good -- although it depends on other factors.
So of course the chain-of-thought transition process is brittle, and it's brittle in ways that don't match human mistakes. What does it prove that there are counter-examples with irrelevant text interposed that cause the model to produce the wrong output? It shows nothing --- it's a probabilistic process. Of course some different inputs lead to different paths being taken, which may be less successful.
> The mechanism that the model uses to transition towards the answer is to generate intermediate text.
Yes, which makes sense, because if there's a landscape of states that the model is traversing, and there are probablistically likely pathways between an initial state and the desired output, but there isn't a direct pathway, then training the the model to generate intermediate text in order to move across that landscape so it can reach the desired output state is a good idea.
Presumably LLM companies are aware that there is (in general) no relationship between the generated intermediate text and the output, and the point of the article is that by calling it a "chain of thought" rather than "essentially-meaningless intermediate text which increases the number of potential states the model can reach" users are misled into thinking that the model is reasoning, and may then make unwarranted assumptions, such as that the model could in general apply the same reasoning to similar problems, which is in general not true.
Meaningless? The participation in a usefully predicting path is meaning. A different meaning.
And Gemini has a note at the bottom about mistakes, and many people discuss this. Caveat emptor, as usual.
If you read the comments of AI articles on Arstechnica, you will find that they seem to have becomes the tech bastion of anti-ai. I'm not sure how it happened, but it seems they found or fell into a strong anti-AI niche, and now feed it.
You cannot even see the comments of people who pointed out the flaws in the study, since they are so heavily downvoted.
> It's interesting that there's still such a market for this sort of take.
What do you think the explanation might be for there being "such a market"?
So, you agree with the point that they’re making and you’re mad about it? It’s important to state that the models aren’t doing real reasoning because they are being marketed and sold as if they are.
As for your question: ‘So what does "sophisticated simulators of reasoning-like text" even mean here?’
It means CoT interstitial “reasoning” steps produce text that looks like reasoning, but is just a rough approximation, given that the reasoning often doesn’t line up with the conclusion, or the priors, or reality.
What is "real reasoning"? The mechanism that the models use is well described. They do what they do. What is this article's complaint?
For example - at minimum reasoning should match what actually happened. This is not even a complete set of criteria for reasoning, but at least a minimal baseline. Currently LLM programs are generating BS in the "reasoning" part of the output. For example ask the LLM program to "reason" how it produces a sum of two numbers and you will see that it doesn't match at all with what LLM program did in the background. The "reasoning" it outputs is simply an extract of the reasoning which humans did in the LLM dataset. Even Anthropic officially admits this. If you ask a program how to do maintenance on a gearbox and it replies with very well articulated and correct (important!) guide to harvest wheat, then we can't call it reasoning of any kind, despite that wheat farming guide was correct and logical.
1 reply →
“the mechanism the models us is well described”
Vs
Total AI capex in the past 6 months was greater than US consumer spending
Or
AGI is coming
Or
AI Agents will be able to do most white collar work
——
The paper is addressing parts of the conversation and expectations of AI that are in the HYPE quadrant. There’s money riding on the idea that AI is going to begin to reason reliably. That it will work as a ghost in the machine.
3 replies →
"the reasoning often doesn’t line up with the conclusion, or the priors, or reality."
My dude, have you ever interacted with human reasoning?
Are you sure you are not comparing to human unreason?
Most of what humans think of as reason is actually "will to power". The capability to use our faculties in a way that produces logical conclusions seems like an evolutionary accident, an off-lable use of the brain's machinery for complex social interaction. Most people never learn to catch themselves doing the former when they intended to engage in the latter, some don't know the difference. Fortunately, the latter provides a means of self-correction, the research here hopes to elucidate whether an LLM based reasoning system has the same property.
In other words, given consistent application of reason I would expect a human to eventually draw logically correct conclusions, decline to answer, rephrase the question, etc. But with an LLM, should I expect a non-determisitic infinite walk though plausible nonsense? I expect reaaoning to converge.
Not sure why everyone is downvoting you as I think you raise a good point - these anthropomorphic words like "reasoning" are useful as shorthands for describing patterns of behaviour, and are generally not meant to be direct comparisons to human cognition. But it goes both ways. You can still criticise the model on the grounds that what we call "reasoning" in the context of LLMs doesn't match the patterns we associate with human "reasoning" very well (such as ability to generalise to novel situations), which is what I think the authors are doing.
""Sam Altman says the perfect AI is “a very tiny model with superhuman reasoning".""
It is being marketed as directly related to human reasoning.
Sure, two things can be true. Personally I completely ignore anything Sam Altman (or other AI company CEOs/marketing teams for that matter) says about LLMs.