Comment by comeonbro
5 months ago
Models tested: o1, 4o (August 2024 version), 3.5 Sonnet (June 2024 version)
Notably missing: o3
Consult this graph and extrapolate: https://i.imgur.com/EOKhZpL.png
5 months ago
Models tested: o1, 4o (August 2024 version), 3.5 Sonnet (June 2024 version)
Notably missing: o3
Consult this graph and extrapolate: https://i.imgur.com/EOKhZpL.png
That's a good point. Assuming they're strategic about releasing this benchmark, they likely already evaluated o3 on it and saw that it performs favorably. Perhaps they're now holding off until they have a chance to tune it further, and then release a strong improvement and get additional buzz a bit later on.
Although I wouldn't bet against o3, I think it works to their favor to release it later no matter how well it is doing.
Case 1, does worse than or is on-par with o1: Would be shocking and not a great sign for their test-time compute approach, at least in this domain. Obviously they would not want to release results.
Case 2, slightly better than o1: I think "holding off until they have a chance to tune it further" applies.
Case 3, does much better than o3: They get to release it after another model makes a noticeable improvement on the benchmark, get another good press release to keep hype high, and they get to tune it further before releasing results.
Altman stated they won't release o3 by itself. They plan to release it as part of GPT-5. GPT-5 will incorporate all sub types of model: reasoning, image, video, voice, etc.