← Back to context

Comment by euleriancon

12 hours ago

Are we looking at the same data? On that site I see that opus 4.7's and gpt 5.5's g scores are within each others confidence intervals, and both significantly ahead of the number 3 model.

Your comment makes it sound like they are miles apart, which the benchmark doesn't seem to support.

Edit: I looked at the data more and the two models are only basically equal when looking at the mean of all the tests. Gpt 5.5 significantly outperforms opus 4.7 in coding, while opus 4.7 significantly outperforms in "decision making." I'm not seeing details on what decision making explicitly means.

Decision making refers to the environments where the LLM is called on every tick (like games with social communication), examples here: https://gertlabs.com/spectate.

Because GPT 5.5 just launched and those games take longer to accumulate data for, it just doesn't have enough samples yet. It will end up with a wider lead on Opus, I am sure. Coding evals always have large sample sizes on day 1. Good find, we should probably better adjust the weighting here for decision games with low match counts.

Right, I'm including my own observations in what the leaderboard is showing. Could be confirmation bias, but I use both Opus and GPT extensively and since GPT 5.4 I have noticed that Opus doesn't even begin to touch GPT's level of technical depth. I was hoping Opus 4.7 would close that gap, but unfortunately it doesn't even compare to GPT 5.4 in that sense.

I'm not being a hater, I love Opus for different reasons, but I can't rely on it for its technical ability.