Comment by kasey_junk
3 days ago
Fwiw I run this eval every week on a set of known prompts and I believe the in group differences are bigger than out group.
That is I get more variance between opus 4.6 and itself than I do between the sota models.
I don’t have the budget for statistical relevance but I’m convinced people claiming broad differences are just vibing, or there are times when agent features make a big difference.
it may be the agent features in my case. now that i think about it, i also forgot that my CLAUDE.md is different from my AGENTS.md
either way, all that one can really rely on is the benchmarks, and those are easily cheated/overfitted to.
i think it's all very hard to quantify, so take my previous comment with a massive rock of salt