Comment by gizmodo59

5 hours ago

5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!

41 comments

gizmodo59

wasmainiac 4 hours ago

Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on the hours and load on OpenAI’s servers? How do I know I’m not getting a severe penalty for chatting at the wrong time. Or even, are the models best after launch then slowly eroded away at to more economical settings after the hype wears off?

tedsanders 3 hours ago
We don't vary our model quality with time of day or load (beyond negligible non-determinism). It's the same weights all day long with no quantization or other gimmicks. They can get slower under heavy load, though.
(I'm from OpenAI.)
- wasmainiac 33 minutes ago
  
  Thanks for the response, I appreciate it. I do notice variation in quality throughout the day. I use it primarily for searching documentation since it’s faster than google in most case, often it is on point, but also it seems off at times, inaccurate or shallow maybe. In some cases I just end the session.
  
  1 reply →
- zamadatix 2 hours ago
  
  I appreciate you taking the time to respond to these kinds of questions the last few days.
- Trufa 3 hours ago
  
  Can you be more specific than this? does it vary in time from launch of a model to the next few months, beyond tinkering and optimization?
  
  7 replies →
- Someone1234 3 hours ago
  
  Specifically including routing (i.e. which model you route to based on load/ToD)?
  PS - I appreciate you coming here and commenting!
  
  2 replies →
Corence 4 hours ago
It is a fair question. I'd expect the numbers are all real. Competitors are going to rerun the benchmark with these models to see how the model is responding and succeeding on the tasks and use that information to figure out how to improve their own models. If the benchmark numbers aren't real their competitors will call out that it's not reproducible.
However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.
- mrandish 1 hour ago
  
  > I'd expect the numbers are all real.
  I think a lot of people are concerned due to 1) significant variance in performance being reported by a large number of users, and 2) We have specific examples of OpenAI and other labs benchmaxxing in the recent past (https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...).
  It's tricky because there are so many subtle ways in which "the numbers are all real" could be technically true in some sense, yet still not reflect what a customer will experience (eg harnesses, etc). And any of those ways can benefit the cost structures of companies currently subsidizing models well below their actual costs with limited investor capital. All with billions of dollars in potential personal wealth at stake for company employees and dozens of hidden cost/performance levers at their disposal.
  And it doesn't even require overt deception on anyone's part. For example, the teams doing benchmark testing of unreleased new models aren't the same people as the ops teams managing global deployment/load balancing at scale day-to-day. If there aren't significant ongoing resources devoted to specifically validating those two things remain in sync - they'll almost certainly drift apart. And it won't be anyone's job to even know it's happening until a meaningful number of important customers complain or sales start to fall. Of course, if an unplanned deviation causes costs to rise over budget, it's a high-priority bug to be addressed. But if the deviation goes the other way and costs are little lower than expected, no one's getting a late night incident alert. This isn't even a dig at OpenAI in particular, it's just the default state of how large orgs work.
ifwinterco 3 hours ago
On benchmarks GPT 5.2 was roughly equivalent to Opus 4.5 but most people who've used both for SWE stuff would say that Opus 4.5 is/was noticeably better
- CraigJPerry 2 hours ago
  
  There's an extended thinking mode for GPT 5.2 i forget the name of it right at this minute. It's super slow - a 3 minute opus 4.5 prompt is circa 12 minutes to complete in 5.2 on that super extended thinking mode but it is not a close race in terms of results - GPT 5.2 wins by a handy margin in that mode. It's just too slow to be useable interactively though.
  
  1 reply →
- georgeven 3 hours ago
  
  Interesting. Everyone in my circle said the opposite.
  
  2 replies →
- elAhmo 3 hours ago
  
  I mostly used Sonnet/Opus 4.x in the past months, but 5.2 Codex seemed to be on par or better for my use case in the past month. I tried a few models here and there but always went back to Claude, but with 5.2 Codex for the first time I felt it was very competitive, if not better.
  Curious to see how things will be with 5.3 and 4.6
- SatvikBeri 1 hour ago
  
  I pretty consistently heard people say Codex was much slower but produced better results, making it better for long-running work in the background, and worse for more interactive development.
smcleod 2 hours ago

I don't think much from OpenAI can be trusted tbh.
aaaalone 4 hours ago

At the end of the day you test it for your use cases anyway but it makes it a great initial hint if it's worth it to test out.
cyanydeez 4 hours ago
When do you think we should run this benchmark? Friday, 1pm? Monday 8AM? Wednesday 11AM?
I definitely suspect all these models are being degraded during heavy loads.
- j_maffe 4 hours ago
  
  This hypothesis is tested regularly by plenty of live benchmarks. The services usually don't decay in performance.
thinkingtoilet 2 hours ago
We know Open AI got caught getting benchmark data and tuning their models to it already. So the answer is a hard no. I imagine over time it gives a general view of the landscape and improvements, but take it with a large grain of salt.
- rvz 1 hour ago
  
  The same thing was done with Meta researchers with Llama 4 and what can go wrong when 'independent' researchers begin to game AI benchmarks. [0]
  You always have to question these benchmarks, especially when the in-house researchers can potentially game them if they wanted to.
  Which is why it must be independent.
  [0] https://gizmodo.com/meta-cheated-on-ai-benchmarks-and-its-a-...

purplerabbit 5 hours ago

The lack of broad benchmark reports in this makes me curious: Has OpenAI reverted to benchmaxxing? Looking forward to hearing opinions once we all try both of these out

MallocVoidstar 4 hours ago
The -codex models are only for 'agentic coding', nothing else.
- dingnuts 4 hours ago
  
  [dead]

nharada 5 hours ago

That's a massive jump, I'm curious if there's a materially different feeling in how it works or if we're starting to reach the point of benchmark saturation. If the benchmark is good then 10 points should be a big improvement in capability...

jkelleyrtp 5 hours ago

claude swe-bench is 80.8 and codex is 56.8

Seems like 4.6 is still all-around better?

gizmodo59 5 hours ago
Its SWE bench pro not swe bench verified. The verified benchmark has stagnated
- joshuahedlund 5 hours ago
  
  Any ideas why verified has stagnated? It was increasing rapidly and then basically stopped.
  
  1 reply →
Rudybega 1 hour ago

You're comparing two different benchmarks. Pro vs Verified.