Comment by gertlabs
1 hour ago
They're within confidence intervals of each other, but remember how much discussion there was that Opus 4.6 had been nerfed in March. We averaged samples over the entire lifetime of Opus 4.6, which likely served many different underlying checkpoints. Even the best version of Opus 4.6 was hardly an upgrade.
We find a lot of interesting anomalies with our benchmark that hold up under large sample sizes.
No comments yet
Contribute on Hacker News ↗