Comment by MattDaEskimo

8 months ago

o3 scores noticeably worse on benchmarks compared to its original announcement benchmarks

3 comments

MattDaEskimo

Any link / source / anything? You got quite an opportunity here, OpenAI employee claiming there's no difference and you got something that shows there is.

MattDaEskimo 8 months ago
Yes, the original announcement for o3 and o4-mini:
https://openai.com/index/introducing-o3-and-o4-mini/
o3 scored 91.6 on AIME 2024. 83.3 on GPQA
o4-mini scored 93.4, 81.4 GPQA
Then, the new announcement
https://help.openai.com/en/articles/6825453-chatgpt-release-...
o3 scored 90 on AIME 2024, 81 GPQA
o4-mini wasn't measured
---
Codeforces is the same, but they have a footnote that they're using a different dataset due to saturation, but still have no grounding model to compare with
- tedsanders 8 months ago
  
  First post measures o3 at high reasoning effort. Second measures at medium reasoning effort. It’s the same model, then and now.