Comment by dylanbyte
5 days ago
These are high school level only in the sense of assumed background knowledge, they are extremely difficult.
Professional mathematicians would not get this level of performance, unless they have a background in IMO themselves.
This doesn’t mean that the model is better than them in math, just that mathematicians specialize in extending the frontier of math.
The answers are not in the training data.
This is not a model specialized to IMO problems.
Are you sure this is not specialized to IMO? I do see the twitter thread saying it's "general reasoning" but I'd imagine they RL'd on olympiad math questions? If not I really hope someone from OpenAI says that bc it would be pretty astounding.
They also said this is not part of GPT-5, and “will be released later”. It’s very, very likely a model specifically fine-tuned for this benchmark, where afterwards they’ll evaluate what actual real-world problems it’s good at (eg like “use o4-mini-high for coding”).
Humans who excel at IMO questions are also "fine tuned" on them in the sense that they practice them for hundreds of hours
7 replies →
From my vague rememberance of doing data science years ago, it's very hard not to leak the training set.
Basically how you do RL is that you make a set of training examples of input-output pairs, and set aside a smaller validation set, which you never train on, to check if your model's doing well.
What you do is you tweak the architecture and the training set until it does well on the validation set. By doing so, you inadvertedly leak info about the training set. Perhaps you choose an architecture which does well on the validation set. Perhaps you train more on examples more like ones being validated.
Even without the explicit intent to cheat, it's very hard to avoid this contamination, if you chose a different validation set, you'd end up with a different model.
The questions were published a few days ago. The 2025 IMO just ended.
And the model was in lockdown to avoid this.
>> This is not a model specialized to IMO problems.
How do you know?
Yeah, looking at the GP ... say a sequence of things that are true and plausible. That add your strong, unsupported claim at the end. I remember the approach from when I studied persuasion techniques...
> The answers are not in the training data.
> This is not a model specialized to IMO problems.
Any proof?
There's no proof that this is not made up, let alone any shred of transparency or reproducibility.
There are trillions of dollars at stake in hyping up these products; I take everything these companies write with a cartload of salt.
No, and they're lying on the most important claim: that this is not a model specialized to IMO problems.
From the thread:
> just to be clear: the IMO gold LLM is an experimental research model.
The thread tried to muddy the narrative by saying the methodology can generalize, but no one is claiming the actual model is a generalized model.
There'd be a massively different conversation needed if a generalized model that could become the next iteration of ChatGPT had achieved this level of performance.
It almost certainly is specialized to IMO problems, look at the way it is answering the questions: https://xcancel.com/alexwei_/status/1946477742855532918
E.g here: https://pbs.twimg.com/media/GwLtrPeWIAUMDYI.png?name=orig
Frankly it looks to me like it's using an AlphaProof style system, going between natural language and Lean/etc. Of course OpenAI will not tell us any of this.
OpenAI explicitly stated that it is natural language only, with no tools such as Lean.
https://x.com/alexwei_/status/1946477745627934979?s=46&t=Hov...
Nope
https://x.com/polynoamial/status/1946478249187377206?s=46&t=...
If you don't have a Twitter account then x.com links are useless, use a mirror: https://xcancel.com/polynoamial/status/1946478249187377206
Anyway, that doesn't refute my point, it's just PR from a weaselly and dishonest company. I didn't say it was "IMO-specific" but the output strongly suggests specialized tooling and training, and they said this was an experimental LLM that wouldn't be released. I strongly suspect they basically attached their version of AlphaProof to ChatGPT.
7 replies →
I actually think this “cheating” is fine. In fact it’s preferable. I don’t need an AI that can act as a really expensive calculator or solver. We’ve already built really good calculators and solvers that are near optimal. What has been missing is the abductive ability to successfully use those tools in an unconstrained space with agency. I find really no value in avoiding the optimal or near optimal techniques we’ve devised rather than focusing on the harder reasoning tasks of choosing tools, instrumenting them properly, interpreting their results, and iterating. This is the missing piece in automated reasoning after all. A NN that can approximate at great cost those tools is a parlor trick and while interesting not useful or practical. Even if they have some agent system here, it doesn’t make the achievement any less that a machine can zero shot do as well as top humans at incredibly difficult reasoning problems posed in natural language.
> I actually think this “cheating” is fine. In fact it’s preferable.
The thing with IMO, is the solutions are already known by someone.
So suppose the model got the solutions beforehand, and fed them into the training model. Would that be an acceptable level of "cheating" in your view?
2 replies →
Why is "almost certainly"? The link you provided has this to say:
> 5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.
Also from the thread:
> 8/N Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model.
And from Sam Altman:
> we are releasing GPT-5 soon but want to set accurate expectations: this is an experimental model that incorporates new research techniques we will use in future models.
The wording you quoted is very tricky: the method used to create the model is generalizable, but the model is not a general-use model.
If I have a post-training method that allows a model excel at a narrow task, it's still a generalizable method if there's a wide range of narrow tasks that it works on.
Since this looks like geometric proof, I wonder if the AI operates only on logical/mathematical statements or it actually somehow 'visualizes' the proof like a human would while solving.
[flagged]
[flagged]
[flagged]
No I assure you >50% of working mathematicians will not score gold level at IMO consistently (I'm in the field). As the original parent said, pretty much only ppl who had the training in high school can. Like number theorists without training might be able to do some number theory IMO questions but this level is basically impossible without specialized training (with maybe a few exceptions of very strong mathematicians)
> No I assure you >50% of working mathematicians will not score gold level at IMO consistently (I'm in the field)
I agree with you. However, would a lot of working mathematicians score gold level without the IMO time constraints? Working mathematicians generally are not trying to solve a problem in the time span of one hour. I would argue that most working mathematicians, if given an arbitrary IMO problem and allowed to work on it for a week, would solve it. As for "gold level", with IMO problems you either solve one or you don't.
You could counter that it is meaningless to remove the time constraints. But we are comparing humans with OpenAI here. It is very likely OpenAI solved the IMO problems in a matter of minutes, maybe even seconds. When we talk about a chatbot achieving human-level performance, it's understood that the time is not a constraint on the human side. We are only concerned with the quality of the human output. For example: can OpenAI write a novel at the level of Jane Austen? Maybe it can, maybe it can't (for now) but Jane Austen was spending years to write such a novel, while our expectation is for OpenAI to do it at the speed of multiple words per second.
2 replies →
I sense we may just have a different experience related to colleagues skill sets as I can think of 5 people I could send some questions too and I know they would do them just fine. Infact we often have done similar problems on a free afternoon and I often do similar on flights as a way to pass the time and improve my focus (my issue isn't my talent/understanding at maths, it's my ability to concentrate). I don't disagree that some level of training is needed but these questions aren't unique, nor impossible, especially as said training does exist and LLM's can access said examples. LLM's also have brute force which is a significant help with these type of issues. One particular point is that Math of all the STEM topics to try and focus on probably is the best documented alongside CS.
3 replies →
I am a professor in a math department (I teach statistics but there is a good complement of actual math PhDs) and there are only about 10% who care about these types of problems and definitely less than half who could get gold on an IMO test even if they didn’t care.
They are all outstanding mathematicians, but the IMO type questions are not something that mathematicians can universally solve without preparation.
There are of course some places that pride themselves on only taking “high scoring” mathematicians, and people will introduce themselves with their name and what they scored on the Putnam exam. I don’t like being around those places or people.
100% agree with this.
My second degree is in mathematics. Not only can I probably not do these but they likely aren’t useful to my work so I don’t actually care.
I’m not sure an LLM could replace the mathematical side of my work (modelling). Mostly because it’s applied and people don’t know what they are asking for, what is possible or how to do it and all the problems turn out to be quite simple really.
1 reply →
> They are all outstanding mathematicians, but the IMO type questions are not something that mathematicians can universally solve without preparation.
So IMO is basically the leetcode of Mathematics.
1 reply →
So IMO questions are to math what Leetcode is to programming?
I see this distinction a lot, but what is the fundamental difference between competition "math" and professional/research math? If people actually knew then they (young students, and their parents) could decide for themselves if they wanted to engage in either kind of study.
Getting gold at the IMO is pretty damn hard.
I grew up in a relatively underserved rural city. I skipped multiple grades in math, completed the first two years of college math classes while in high school, and won the award for being the best at math out of everyone in my school.
I've met and worked with a few IMO gold medalists. Even though I was used to scoring in the 99th percentile on all my tests, it felt like these people were simply in another league above me.
I'm not trying to toot my own horn. I'm definitely not that smart. But it's just ridiculous to shoot down the capabilities of these models at this point.
The trouble is, getting an IMO gold medal is much easier (by frequency) than being the #1 Go player in the world, which was achieved by AI 10 years ago. I'm not sure it's enough to just gesture at the task; drilling down into precisely how it was achieved feels important.
(Not to take away from the result, which I'm really impressed by!)
6 replies →
IMO questions are to math as leetcode questions are to software engineering. Not necessarily easier or harder but they test ability on different axes. There’s definitely some overlap with undergrad level proof style questions but I disagree that being a working mathematician would necessarily mean you can solve these type of questions quickly. I did a PhD in pure math (and undergrad obv) and I know I’d have to spend time revising and then practicing to even begin answering most IMO questions.
This is probably the right time to bring up this classic:
"Did you win the Putnam?"
https://news.ycombinator.com/item?id=35079