Comment by veselin
7 hours ago
I work a lot on testing also SWE bench verified. This benchmark in my opinion now is good to catch if you got some regression on the agent side.
However, going above 75%, it is likely about the same. The remaining instances are likely underspecified despite the effort of the authors that made the benchmark "verified". From what I have seen, these are often cases where the problem statement says implement X for Y, but the agent has to simply guess whether to implement the same for other case Y' - which leads to losing or winning an instance.
No comments yet
Contribute on Hacker News ↗