Comment by nerevarthelame
1 day ago
It's interesting they only included 6 metrics this time. Opus 4.7 had 12, and 4.6 had 13.
Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.
Their Cybergym score is reportedly awful because of the cybersecurity nerfing. https://x.com/i/status/2060046843023630841
Gonna assume it's because they barely budged or moved downward and most of their reported benchmark results are probably within sampling errors...
They will release a system card, and you can then confirm or disconfirm your assumptions.