← Back to context

Comment by AIPedant

5 days ago

It almost certainly is specialized to IMO problems, look at the way it is answering the questions: https://xcancel.com/alexwei_/status/1946477742855532918

E.g here: https://pbs.twimg.com/media/GwLtrPeWIAUMDYI.png?name=orig

Frankly it looks to me like it's using an AlphaProof style system, going between natural language and Lean/etc. Of course OpenAI will not tell us any of this.

Nope

https://x.com/polynoamial/status/1946478249187377206?s=46&t=...

  • If you don't have a Twitter account then x.com links are useless, use a mirror: https://xcancel.com/polynoamial/status/1946478249187377206

    Anyway, that doesn't refute my point, it's just PR from a weaselly and dishonest company. I didn't say it was "IMO-specific" but the output strongly suggests specialized tooling and training, and they said this was an experimental LLM that wouldn't be released. I strongly suspect they basically attached their version of AlphaProof to ChatGPT.

    • We can only go off their word unfortunately and they say no formal math. so I assume it's being eval'd by a verifier model instead of a formal system. There's actually some hints of this b/c geometry in Lean is not that well developed so unless they also built their own system it's hard to do it formally (though their P2 proof is by coordinate bash (computation by algebra instead of geometric construction) so it's hard to tell.

      5 replies →

    • You don't need specialized tooling like Lean if you have enough training data with statements written in the natural language, I suppose. But the use of AlphaProof/AlphaGeometry type of learning is almost certain. And I'm sure they have spent a lot of compute to produce solutions, $10k is not a problem for them.

      The bigger question is - why should everyone be excited by this? If they don't plan to share anything related to this AI model back to humanity.

I actually think this “cheating” is fine. In fact it’s preferable. I don’t need an AI that can act as a really expensive calculator or solver. We’ve already built really good calculators and solvers that are near optimal. What has been missing is the abductive ability to successfully use those tools in an unconstrained space with agency. I find really no value in avoiding the optimal or near optimal techniques we’ve devised rather than focusing on the harder reasoning tasks of choosing tools, instrumenting them properly, interpreting their results, and iterating. This is the missing piece in automated reasoning after all. A NN that can approximate at great cost those tools is a parlor trick and while interesting not useful or practical. Even if they have some agent system here, it doesn’t make the achievement any less that a machine can zero shot do as well as top humans at incredibly difficult reasoning problems posed in natural language.

  • > I actually think this “cheating” is fine. In fact it’s preferable.

    The thing with IMO, is the solutions are already known by someone.

    So suppose the model got the solutions beforehand, and fed them into the training model. Would that be an acceptable level of "cheating" in your view?

    • Surely you jest. The cheating would be the same cheating as any other situation - someone inside the IMO skipping the questions and answers to people outside then that being used to compete. Fine - but why? If this were discovered then it would be disastrous for everyone involved, and for what? A noteworthy HN link? The downside would be international scandal and careers destroyed. The upside is imperceptible.

      Finally, even if you aligned the model with the answers its weight shift of such an enormous model would be inconsequential. You would need to prime the context or boost the weights. All this seems like absurd lengths to go to to cheat on this one thing rather than focusing your energies on actually improving model performance. The payout for OpenAI isn’t a gold medal in the IMO it’s having a model that can get a gold medal at IMO then selling it. But it has to actually be capable of doing what’s on the tin otherwise their customers will easily and rapidly discover this.

      Sorry, I like tin foil as much as anyone else, but this doesn’t seem credibly likely given the incentive structure.

      1 reply →

Why is "almost certainly"? The link you provided has this to say:

> 5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.

  • Also from the thread:

    > 8/N Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model.

    And from Sam Altman:

    > we are releasing GPT-5 soon but want to set accurate expectations: this is an experimental model that incorporates new research techniques we will use in future models.

    The wording you quoted is very tricky: the method used to create the model is generalizable, but the model is not a general-use model.

    If I have a post-training method that allows a model excel at a narrow task, it's still a generalizable method if there's a wide range of narrow tasks that it works on.

Since this looks like geometric proof, I wonder if the AI operates only on logical/mathematical statements or it actually somehow 'visualizes' the proof like a human would while solving.