Comment by parasubvert
5 days ago
I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result.
Certainly the emergent behaviour is exciting but we tend to jump to conclusions as to what it implies.
This means we are far more trusting with software that lacks formal guarantees than we should be. We are used to software being sound by default but otherwise a moron that requires very precise inputs and parameters and testing to act correctly. System 2 thinking.
Now with NN it's inverted: it's a brilliant know-it-all but it bullshits a lot, and falls apart in ways we may gloss over, even with enormous resources spent on training. It's effectively incredible progress on System 1 thinking with questionable but evolving System 2 skills where we don't know the limits.
If you're not familiar with System 1 / System 2, it's googlable .
> These models cannot reason
Not trying to be a smarty pants here, but what do we mean by "reason"?
Just to make the point, I'm using Claude to help me code right now. In between prompts, I read HN.
It does things for me such as coding up new features, looking at the compile and runtime responses, and then correcting the code. All while I sit here and write with you on HN.
It gives me feedback like "lock free message passing is going to work better here" and then replaces the locks with the exact kind of thing I actually want. If it runs into a problem, it does what I did a few weeks ago, it will see that some flag is set wrong, or that some architectural decision needs to be changed, and then implements the changes.
What is not reasoning about this? Last year at this time, if I looked at my code with a two hour delta, and someone had pushed edits that were able to compile, with real improvements, I would not have any doubt that there was a reasoning, intelligent person who had spent years learning how this worked.
It is pattern matching? Of course. But why is that not reasoning? Is there some sort of emergent behavior? Also yes. But what is not reasoning about that?
I'm having actual coding conversations that I used to only have with senior devs, right now, while browsing HN, and code that does what I asked is being produced.
I think the biggest hint that the models aren't reasoning is that they can't explain their reasoning. Researchers have shown for explained that how a model solves a simple math problem and how it claims to have solved it after the fact have no real correlation. In other words there was only the appearance of reasoning.
People can't explain their reasoning either. People do a parallel construction of logical arguments for a conclusion they already reached intuitively in a way they have no clue how it happened. "The idea just popped into my head while showering" to our credit, if this post-hoc rationalization fails we are able to change our opinion to some degree.
3 replies →
Is this true though? I've suggested things that it pushed back on. Feels very much like a dev. It doesn't just dumbly do what I tell it.
1 reply →
I'm having actual coding conversations that I used to only have with senior devs, right now, while browsing HN, and code that does what I asked is being produced.
I’m using Opus 4 for coding and there is no way that model demonstrates any reasoning or demonstrates any “intelligence” in my opinion. I’ve been through the having conversations phase etc but doesn’t get you very far, better to read a book.
I use these models to help me type less now, that’s it. My prompts basically tell it to not do anything fancy and that works well.
> It will do something brilliant and another 5 dumb things in the same prompt.
it me
YOU are reasoning.
You raise a far point. These criticisms based on "it's merely X" or "it's not really Y" don't hold water when X and Y are poorly defined.
The only thing that should matter is the results they get. And I have a hard time understanding why the thing that is supposed to behave in an intelligent way but often just spew nonsense gets 10x budget increases over and over again.
This is bad software. It does not do the thing it promises to do. Software that sometimes works and very often produces wrong or nonsensical output is garbage software. Sink 10x, 100x, 1000x more resources into it is irrational.
Nothing else matters. Maybe it reasons, maybe it's intelligent. If it produces garbled nonsense often, giving the teams behind it 10x the compute is insane.
"Software that sometimes works and very often produces wrong or nonsensical output" can be extremely valuable when coupled with a way to test whether the result is correct.
> It does not do the thing it promises to do. Software that sometimes works and very often produces wrong or nonsensical output...
Is that very unlike humans?
You seem to be comparing LLMs to much less sophisticated deterministic programs. And claiming LLMs are garbage because they are stochastic.
Which entirely misses the point because I don't want an LLM to render a spreadsheet for me in a fully reproducible fashion.
No, I expect an LLM to understand my intent, reason about it, wield those smaller deterministic tools on my behalf and sometimes even be creative when coming up with a solution, and if that doesn't work, dream up some other method and try again.
If _that_ is the goal, then some amount of randomness in the output is not a bug it's a necessary feature!
You're right, they should never have given more resources and compute to the OpenAI team after the disaster called GPT-2, which only knew how to spew nonsense.
We already have highly advanced deterministic software. The value lies in the abductive “reasoning” and natural language processing.
We deal with non determinism any time our code interacts with the natural world. We build guard rails, detection, classification of false/true positive and negatives, and all that all the time. This isn’t a flaw, it’s just the way things are for certain classes of problems and solutions.
It’s not bad software - it’s software that does things we’ve been trying to do for nearly a hundred years beyond any reasonable expectation. The fact I can tell a machine in human language to do some relative abstract and complex task and it pretty reliably “understands” me and my intent, “understands” it’s tools and capabilities, and “reasons” how to bridge my words to a real world action is not bad software. It’s science fiction.
The fact “reliably” shows up is the non determinism. Not perfectly, although on a retry with a new seed it often succeeds. This feels like most software that interacts with natural processes in any way or form.
It’s remarkable that anyone who has ever implemented exponential back off and retry, has ever implemented edge cases, and sir and say “nothing else matters,” when they make their living dealing with non determinism. Because the algorithmic kernel of logic is 1% of programming and systems engineering, and 99% is coping with the non determinism in computing systems.
The technology is immature and the toolchains are almost farcically basic - money is dumping into model training because we have not yet hit a wall with brute force. And it takes longer to build a new way of programming and designing highly reliable systems in the face of non determinism, but it’s getting better faster than almost any technology change in my 35 years in the industry.
Your statement that it “very often produces wrong or nonsensical output” also tells me you’re holding onto a bias from prior experiences. The rate of improvement is astonishing. At this point in my professional use of frontier LLMs and techniques they are exceeding the precision and recall of humans and there’s a lot of rich ground untouched. At this point we largely can offload massive amounts of work that humans would do in decision making (classification) and use humans as a last line to exercise executive judgement often with the assistance of LLMs. I expect within two years humans will only be needed in the most exceptional of situations, and we will do a better job on more tasks than we ever could have dreamed of with humans. For the company I’m at this is a huge bottom line improvement far and beyond the cost of our AI infrastructure and development, and we do quite a lot of that too.
If you’re not seeing it yet, I wouldn’t use that to extrapolate to the world at large and especially not to the future.
>I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result
This is rampant human chauvinism. There's absolutely no empirical basis for the statement that these models "cannot reason", it's just pseudoscientific woo thrown around by people who want to feel that humans are somehow special. By pretty much every empirical measure of "reasoning" or intelligence we have, SOTA LLMs are better at it than the average human.
> This is rampant human chauvinism
What in the accelerationist hell?
There's nothing accelerationist about recognising that making unfalsifiable statements about LLMs lacking intelligence or reasoning ability serves zero purpose except stroking the speaker's ego. Such people are never willing to give a clear criteria for what would constitute proof of machine reasoning for them, which shows their belief isn't based on science or reason.
I’ve used these AI tools for multiple hours a day for months. Not seeing the reasoning party honestly. I see the heuristics part.
I guess your work doesn't involve any maths then, because then you'd see they're capable of solving maths problems that require a non-trivial amount of reasoning steps.
2 replies →