Comment by dcre

2 days ago

The other commenter is more articulate, but you simply cannot draw the conclusion from this paper that reasoning models don't work well. They trained tiny little models and showed they don't work. Big surprise! Meanwhile every other piece of evidence available shows that reasoning models are more reliable at sophisticated problems. Just a few examples.

- https://arcprize.org/leaderboard

- https://aider.chat/docs/leaderboards/

- https://arstechnica.com/ai/2025/07/google-deepmind-earns-gol...

Surely the IMO problems weren't "within the bounds" of Gemini's training data.

3 comments

dcre

robrenaud 2 days ago

The Gemini IMO result used a specifically fine tuned model for math.

Certainly they weren't training on the unreleased problems. Defining out of distribution gets tricky.

simianwords 2 days ago

>The Gemini IMO result used a specifically fine tuned model for math.
This is false.
https://x.com/YiTayML/status/1947350087941951596
This is false even for the OpenAI model
https://x.com/polynoamial/status/1946478250974200272
"Typically for these AI results, like in Go/Dota/Poker/Diplomacy, researchers spend years making an AI that masters one narrow domain and does little else. But this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques."
Workaccount2 2 days ago

Every human taking that exam has fine tuned for math, specifically on IMO problems.