← Back to context

Comment by MattDaEskimo

5 days ago

Yes, the original announcement for o3 and o4-mini:

https://openai.com/index/introducing-o3-and-o4-mini/

o3 scored 91.6 on AIME 2024. 83.3 on GPQA

o4-mini scored 93.4, 81.4 GPQA

Then, the new announcement

https://help.openai.com/en/articles/6825453-chatgpt-release-...

o3 scored 90 on AIME 2024, 81 GPQA

o4-mini wasn't measured

---

Codeforces is the same, but they have a footnote that they're using a different dataset due to saturation, but still have no grounding model to compare with

First post measures o3 at high reasoning effort. Second measures at medium reasoning effort. It’s the same model, then and now.