Comment by johnecheck
5 days ago
Wow. That's an impressive result, but how did they do it?
Wei references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.
If this is legit, then we need to know what tools were used and how the model used them. I'd bet those are the 'techniques to make them better at hard to verify tasks'.
Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting? Regardless of the tools for verification or even solvers - why is the goal post moving so fast? There is no bonus for “purity of essence” and using only neural networks. We live in an era where it’s hard to tell if machines are thinking or not, which for since the first computing machines was seen as the ultimate achievement. Now we Pooh Pooh the results of each iteration - which unfold month over month not decade over decade now.
You don’t have to be hyped to be amazed. You can retain the ability to dream while not buying into the snake oil. This is amazing no matter what ensemble of techniques used. In fact - you should be excited if we’ve started to break out of the limitations of forcing NN to be load bearing in literally everything. That’s a sign of maturing technology not of limitations.
>> Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting?
Half the internet is convinced that LLMs are a big data cheating machine and if they're right then, yes, boldly cheating where nobody has cheated before is not that exciting.
I don't get it, how do you "big data cheat" an AI into solving previously unencountered problems? Wouldn't that just be engineering?
6 replies →
Without sharing their methodology, how can we trust the claim ? questions like:
1) did humans formalize the input 2) did humans prompt the llm towards the solution etc..
I am excited to hear about it, but I remain skeptical.
>Why is that less exciting?
Because if I have to throw 10000 rocks to get one in the bucket, I am not as good/useful of a rock-into-bucket-thrower as someone who gets it in one shot.
People would probably not be as excited about the prospect of employing me to throw rocks for them.
If you don't have automatic way to verify solution then picking correct answer from 10 000 is more impressive than coming with some answer in the first place. If AI tech will be able to effectively prune tree search without eval that would be super big leap but I doubt they achieved this.
It’s exciting because nearly all humans have 0% chance of throwing the rock into the bucket, and most people believed a rock-into-bucket-thrower machine is impossible. So even an inefficient rock-into-bucket-thrower is impressive.
But the bar has been getting raised very rapidly. What was impressive six months ago is awful and unexciting today.
1 reply →
I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result.
Certainly the emergent behaviour is exciting but we tend to jump to conclusions as to what it implies.
This means we are far more trusting with software that lacks formal guarantees than we should be. We are used to software being sound by default but otherwise a moron that requires very precise inputs and parameters and testing to act correctly. System 2 thinking.
Now with NN it's inverted: it's a brilliant know-it-all but it bullshits a lot, and falls apart in ways we may gloss over, even with enormous resources spent on training. It's effectively incredible progress on System 1 thinking with questionable but evolving System 2 skills where we don't know the limits.
If you're not familiar with System 1 / System 2, it's googlable .
> These models cannot reason
Not trying to be a smarty pants here, but what do we mean by "reason"?
Just to make the point, I'm using Claude to help me code right now. In between prompts, I read HN.
It does things for me such as coding up new features, looking at the compile and runtime responses, and then correcting the code. All while I sit here and write with you on HN.
It gives me feedback like "lock free message passing is going to work better here" and then replaces the locks with the exact kind of thing I actually want. If it runs into a problem, it does what I did a few weeks ago, it will see that some flag is set wrong, or that some architectural decision needs to be changed, and then implements the changes.
What is not reasoning about this? Last year at this time, if I looked at my code with a two hour delta, and someone had pushed edits that were able to compile, with real improvements, I would not have any doubt that there was a reasoning, intelligent person who had spent years learning how this worked.
It is pattern matching? Of course. But why is that not reasoning? Is there some sort of emergent behavior? Also yes. But what is not reasoning about that?
I'm having actual coding conversations that I used to only have with senior devs, right now, while browsing HN, and code that does what I asked is being produced.
15 replies →
>I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result
This is rampant human chauvinism. There's absolutely no empirical basis for the statement that these models "cannot reason", it's just pseudoscientific woo thrown around by people who want to feel that humans are somehow special. By pretty much every empirical measure of "reasoning" or intelligence we have, SOTA LLMs are better at it than the average human.
6 replies →
Because the usefulness of an AI model is reliably solving a problem, not being able to solve a problem given 10,000 tries.
Claude Code is still only a mildly useful tool because it's horrific beyond a certain breadth of scope. If I asked it to solve the same problem 10,000 times I'm sure I'd get a great answer to significantly more difficult problems, but that doesn't help me as I'm not capable of scaling myself to checking 10,000 answers.
>if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.
That entirely depends on who did the cherry picking. If the LLM had 10000 attempts and each time a human had to falsify it, this story means absolutely nothing. If the LLM itself did the cherry picking, then this is just akin to a human solving a hard problem. Attempting solutions and falsifying them until the desired result is achieved. Just that the LLM scales with compute, while humans operate only sequentially.
The key bit here is whether the LLM doing the cherry picking had knowledge of the solution. If it didn't, this is a meaningful result. That's why I'd like more info, but I fear OpenAI is going to try to keep things under wraps.
Mark Chen posted that the system was locked before the contest. [1] It would obviously be crazy cheating to give verifiers a solution to the problem!
[1] https://x.com/markchen90/status/1946573740986257614?s=46&t=H...
> If it didn't
We kind of have to assume it didn't right? Otherwise bragging about the results makes zero sense and would be outright misleading.
7 replies →
I don't think it's much less exciting if they ran it 10000 parallel? It implies an ability to discern when the proof is correct and rigorous (which o3 can't do consistently) and also means that outputting the full proof is within capabilities even if rare.
The whole point of RL is if you can get it to work 0.01% of the time you can get it to work 100% of the time.
> if OpenAI ran this 10000 times in parallel and cherry-picked the best one
This is almost certainly the case, remember the initial o3 ARC benchmark? I could add this is probably multi-agent system as well, so the context length restriction can be bypassed.
Overall, AI good at math problems doesn't make news to me. It is already better than 99.99% of humans, now it is better than 99.999% of us. So ... ?
> what tools were used and how the model used them
According to the twitter thread, the model was not given access to tools.