Any link / source / anything? You got quite an opportunity here, OpenAI employee claiming there's no difference and you got something that shows there is.
Codeforces is the same, but they have a footnote that they're using a different dataset due to saturation, but still have no grounding model to compare with
Any link / source / anything? You got quite an opportunity here, OpenAI employee claiming there's no difference and you got something that shows there is.
Yes, the original announcement for o3 and o4-mini:
https://openai.com/index/introducing-o3-and-o4-mini/
o3 scored 91.6 on AIME 2024. 83.3 on GPQA
o4-mini scored 93.4, 81.4 GPQA
Then, the new announcement
https://help.openai.com/en/articles/6825453-chatgpt-release-...
o3 scored 90 on AIME 2024, 81 GPQA
o4-mini wasn't measured
---
Codeforces is the same, but they have a footnote that they're using a different dataset due to saturation, but still have no grounding model to compare with
First post measures o3 at high reasoning effort. Second measures at medium reasoning effort. It’s the same model, then and now.