Comment by nsoonhui

3 hours ago

I really have to take your score with a grain of salt because Opus 4.5 does better than Opus 4.6

They're within confidence intervals of each other, but remember how much discussion there was that Opus 4.6 had been nerfed in March. We averaged samples over the entire lifetime of Opus 4.6, which likely served many different underlying checkpoints. Even the best version of Opus 4.6 was hardly an upgrade.

We find a lot of anomalies in our benchmark.