Comment by janalsncm
2 days ago
> Trying out o3 Pro made me realize that models today are so good in isolation, we’re running out of simple tests.
Are Towers of Hanoi not a simple test? Or chess? A recursive algorithm that runs on my phone can outclass enormous models that cost billions to train.
A reasoning model should be able to reason about things. I am glad models are better and more useful than before but for an author to say they can’t even evaluate o3 makes me question their credibility.
https://machinelearning.apple.com/research/illusion-of-think...
AGI means the system can reason through any problem logically, even if it’s less efficient than other methods.
This isn't my language (saw it on a youtube video but agree with it) -- LLMs are not calculators. It's as simple as that.
If the LLM can complete the task using tools, then it's a pass.
Apples team went out of their way to select tests that LLMs would struggle with and then take away tools -- and then have the audacity to write that they're surprised at the outcome. Who would be surprised? No one using AI since GPT-4 expects them to be calculators or algorithm executors.
You want the LLM to be smart enough to realize "I can't do this without tools", grab the tool, use it correctly, and give you the actual correct answer. Preventing LLMs from using tools or writing and executing code -- then you're intentionally crippling them.
I think that’s perfectly reasonable for problems that have already been solved and for which tools already exist. But there are a lot of problems for which tools don’t exist and will need to be developed.
In other words, being able to go to the produce aisle means I don’t need to know how to farm, but it also doesn’t make me a farmer.
The towers of Hanoi one is kind of weird, the prompt asks for a complete move by move solution and the 15 or 20 disk version (where reasoning models fail) means the result is unreasonably long and very repetitive. Likely as not it's just running into some training or sampler quirk discouraging the model to just dump huge amounts of low-entropy text.
I don't have a Claude in front of me -- if you just give it the algorithm to produce the answer and ask it to give you the huge output for n=20, will it even do that?
If I have to give it the algorithm as well as the problem, we’re no longer even pretending to be in the AGI world. If it falls down interpreting an algorithm it is worse than even a python interpreter.
Towers of Hanoi is a well-known toy problem. The algorithm is definitely in any LLM’s training data. So it doesn’t even need to come up with a new algorithm.
There may be some technical reason it’s failing but the more fundamental reason is that an autoregressive statistical token generator isn’t suited to solving problems with symbolic solutions.
I'm just saying ~10MB of short repetitive text lines might be out of scope as a response the LLM driver is willing to give at all, regardless of how derived
1 reply →
I doubt I could reliably solve Towers of Hanoi in my head for more than 3 or 4 discs.
Fair point, but the idea of these “reasoning” models is that they have a scratchpad to figure it out before giving an answer.
[dead]
You are the only person suggesting that o3 is AGI or even an approach to AGI. They’re different beasts entirely.
It single-shots the towers of Hanoi https://chatgpt.com/share/6848fff7-0080-8013-a032-e18c999dc3...
It’s not correct.
In move 95 the disks are
Tower 1: 10, 9, 8, 5, 4, 3, 2, 1
Tower 2: 7
Tower 3: 6
It attempts to move disk 6 from tower 2 to tower 3, but disk 6 is already at tower 3, and moving 7 on top of 6 would be illegal.
In fact this demonstrates that o3 is unable to implement a simple recursive algorithm.
I find it amusingly ironic how one comment under yours is pointing out that there’s a mistake in the model output, and the other comment under yours trusts that it’s correct but says that it isn’t “real reasoning” anyways because it knows the algorithm. There’s probably something about moving goalposts to be said here
If both criterion A and B need to be satisfied for something to be true, it’s not moving the goalposts for one person to point out A is not true, and another person to point out that B is not true.
This isn’t reasoning at all. It’s applying a well known algorithm to a problem. It literally says “classic” in its response.
It is “reasoning” in the same way that a calculator or compiler is reasoning. But I checked the solution, it’s actually wrong so it’s a moot point.
4 replies →