Comment by dwohnitmok

1 day ago

Is everyone just glossing over the first place score of 118/120 on the Putnam?! I mean we'll see how it does on the upcoming 2025 test, but that's insane!

We've seen absolutely ridiculous progress in model capability over the past year (which is also quite terrifying).

7 comments

dwohnitmok

hooloovoo_zoo 13 hours ago

For one thing, it's not a real score; they judged the results themselves and Putnam judges are notoriously tough. There was not a single 8 on the problem they claim partial credit for (or any partial credit above a 2) amongst the top 500 humans. https://kskedlaya.org/putnam-archive/putnam2024stats.html.

For another thing, the 2024 Putnam problems are in their RL data.

Also, it's very unclear how these competitions consisting of problems designed to have clear-cut answers and be solved by (well-prepared) humans in an hour will translate to anything else.

westurner 10 hours ago
What do other models trained on the same problems score? What about if they are RL'd to not reproduce things word for word?
Why do you think that the 2024 Putnam programs that they used to test were in the training data?
/? "Art of Problem Solving" Putnam https://www.google.com/search?q=%22Art+of+Problem+Solving%22...
From p.3 of the PDF:
> Curating Cold Start RL Data: We constructed our initial training data through the following process:
> 1. We crawled problems from Art of Problem Solving (AoPS) contests , prioritizing math olympiads, team selection tests, and post-2010 problems explicitly requiring proofs, total- ing 17,503 problems.
- isotypic 9 hours ago
  
  > Why do you think that the 2024 Putnam programs that they used to test were in the training data?
  Putnam solutions can be found multiple places online: https://kskedlaya.org/putnam-archive/, https://artofproblemsolving.com/community/c3249_putnam. These could have appeared in the training of the base LLM DeepSeek-V3.2-Exp or as problems in the training set - they do not give further detail on what problems they selected from AOPS and as the second link gives they are there.

Davidzheng 18 hours ago

I think serious math research progress should come in 1-2 years. It basically only depends on how hard informal verification is, because training data should be not a problem and if informal verification is easy you can throw RL compute at it until it improves.

trenchgun 14 hours ago

LLMs are already a powerful tool for serious math researchers, just not at the level of "fire and forget", where they would completely replace mathematicians.

N_Lens 19 hours ago

Also the impressive IMO-ProofBench Basic benchmark, the model achieved nearly 99% accuracy, though it fell slightly behind Gemini Deep Think on the Advanced subset.

The approach shifts from "result-oriented" to "process-oriented" verification, particularly important for theorem proving where rigorous step-by-step derivation matters more than just numerical answers.

AlexCoventry 16 hours ago

"Process-oriented" verification has been a thing for a while in mathematical reasoning CoT. Google had a paper about it last year [1]. The key term to look for is "Process-reward model." I particularly like RL Tango [2].
[1] https://arxiv.org/abs/2406.06592
[2] https://arxiv.org/abs/2505.15034