Comment by cjsaltlake
19 hours ago
If you read the mythos report, in which they discuss and account for contamination substantially, it still suggests that performance on SWE-bench verified is meaningful. Benchmarks, including SWE-bench can absolutely be gamed, but if you're not explicitly benchmaxxing, improving on SWE-bench still measures model improvements, at least up to the level of Mythos.
No comments yet
Contribute on Hacker News ↗