Comment by trunch

3 months ago

Which of the LiveCodeBench Pro and SWE-Bench Verified benchmarks comes closer to everyday coding assistant tasks?

Because it seems to lead by a decent margin on the former and trails behind on the latter

2 comments

trunch

I work a lot on testing also SWE bench verified. This benchmark in my opinion now is good to catch if you got some regression on the agent side.

However, going above 75%, it is likely about the same. The remaining instances are likely underspecified despite the effort of the authors that made the benchmark "verified". From what I have seen, these are often cases where the problem statement says implement X for Y, but the agent has to simply guess whether to implement the same for other case Y' - which leads to losing or winning an instance.

Snuggly73 3 months ago

Neither :(

LCB Pro are leet code style questions and SWE bench verified is heavily benchmaxxed very old python tasks.