Comment by comeonbro

5 months ago

Models tested: o1, 4o (August 2024 version), 3.5 Sonnet (June 2024 version)

Notably missing: o3

Consult this graph and extrapolate: https://i.imgur.com/EOKhZpL.png

3 comments

comeonbro

That's a good point. Assuming they're strategic about releasing this benchmark, they likely already evaluated o3 on it and saw that it performs favorably. Perhaps they're now holding off until they have a chance to tune it further, and then release a strong improvement and get additional buzz a bit later on.

throwaway0123_5 5 months ago
Although I wouldn't bet against o3, I think it works to their favor to release it later no matter how well it is doing.
Case 1, does worse than or is on-par with o1: Would be shocking and not a great sign for their test-time compute approach, at least in this domain. Obviously they would not want to release results.
Case 2, slightly better than o1: I think "holding off until they have a chance to tune it further" applies.
Case 3, does much better than o3: They get to release it after another model makes a noticeable improvement on the benchmark, get another good press release to keep hype high, and they get to tune it further before releasing results.
- sandspar 5 months ago
  
  Altman stated they won't release o3 by itself. They plan to release it as part of GPT-5. GPT-5 will incorporate all sub types of model: reasoning, image, video, voice, etc.