Comment by modeless

5 hours ago

It's so difficult to compare these models because they're not running the same set of evals. I think literally the only eval variant that was reported for both Opus 4.6 and GPT-5.3-Codex is Terminal-Bench 2.0, with Opus 4.6 at 65.4% and GPT-5.3-Codex at 77.3%. None of the other evals were identical, so the numbers for them are not comparable.

5 comments

modeless

alexhans 4 hours ago

Isn't the best eval the one you build yourself, for your own use cases and value production?

I encourage people to try. You can even timebox it and come up with some simple things that might look initially insufficient but that discomfort is actually a sign that there's something there. Very similar to moving from not having unit/integration tests for design or regression and starting to have them.

rsanek 5 hours ago

I usually wait to see what ArtificialAnalysis says for a direct comparison.

input_sh 5 hours ago

It's better on a benchmark I've never heard of!? That is groundbreaking, I'm switching immediately!

modeless 5 hours ago
I also wasn't that familiar with it, but the Opus 4.6 announcement leaned pretty heavily on the TerminalBench 2.0 score to quantify how much of an improvement it was for coding, so it looks pretty bad for Anthropic that OpenAI beat them on that specific benchmark so soundly.
Looking at the Opus model card I see that they also have by far the highest score for a single model on ARC-AGI-2. I wonder why they didn't advertise that.
- input_sh 5 hours ago
  
  No way! Must be a coinkydink, no way OpenAI knew ahead of time that Anthropic was gonna put a focus on that specific useless benchmark as opposed to all the other useless benchmarks!?
  I'm firing 10 people now instead of 5!