Comment by embedding-shape
1 day ago
I feel like they're quite open about why they think the benchmark doesn't work anymore:
> We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.
> This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.
No comments yet
Contribute on Hacker News ↗