Comment by ehtbanton
4 days ago
Benchmarks like this one are designed to thoroughly test the model across several iterations. 15% is a MASSIVE discrepancy.
Come on Anthropic, admit what you're doing already and let us access your best models unhindered, even if it costs us more. At the moment we just all feel short-changed.
No comments yet
Contribute on Hacker News ↗