Comment by Snuggly73
5 months ago
First time commenter - I was so triggered by this benchmark, so I just had to come out of lurking.
I've spent time going over the description and the cases and its an misrepresented travesty.
The benchmark takes existing cases from Upwork, then reintroduces the problems back in the code and then asks the LLM to fix them testing against newly written 'comprehensive tests'.
Lets look at some of the cases:
1. The regex zip code validation problem
Looking at the Upwork problem - https://github.com/Expensify/App/issues/14958 it was mainly that they were using a common regex to validate across all countries, so the solution had to introduce country specific regex etc.
The "reintroduced bug" - https://github.com/openai/SWELancer-Benchmark/blob/main/issu... is just taking that new code and adding , to two countries....
2. Room showing empty - 14857
The "reintroduced bug" - https://github.com/openai/SWELancer-Benchmark/blob/main/issu...
Adds code explicitly commented as introducing a "radical bug" and "intentionally returning an empty array"...
I could go on and on and on...
The "extensive tests" are also laughable :(
I am not sure if OpenAI is actually aware of how great this "benchmark" is, but after so much fanfare - they should be.
They’ve now removed your second example from the testing set - I bet they won’t regenerate their benchmarks without this test.
Good sleuthing, seems someone from OpenAI read your comment and found it embarrassing as well!
For future reference, permalink to the original commit with the RADICAL BUG comment: https://github.com/openai/SWELancer-Benchmark/blob/a8fa46d2b...
The new version (as of now) still has a comment making it obvious that there's an intentionally introduced bug, but it's not as on the nose: https://github.com/openai/SWELancer-Benchmark/blob/2a77e3572...
It was just two examples of widespread problems with the introduced bugs and the tests.
How about this - https://github.com/openai/SWELancer-Benchmark/blob/08b5d3dff... (Intentionally use raw character count instead of HTML-converted length)
Or this one - https://github.com/openai/SWELancer-Benchmark/blob/08b5d3dff... (user is complaining of flickering, so the reintroduced bug adds flickering code :) )
Or the one that they list in A.10 of the paper as O1 successfuly fixing - https://github.com/openai/SWELancer-Benchmark/blob/main/issu...
O1 doesnt actually seem to fix anything (besides arbitrary dumping all over the code), the reintroduced bug is messing with the state, not with the back button navigation.
Anyways, I went thru a sample of 20-30 last night and gave up. Noone needs to take my words - force pushing aside, anyone can pull the repo and check for themselves.
Most of the 'bugs' are trivialized to a massive degree, which a) makes them very easy to solve for b) doesnt reflect their previous monetary value, which in effect makes the whole premise of 'let measure how SWE agents can provide real money value' invalid.
If they wanted to create a real one, they should've found the commits reflecting the state of the app as of the moment of the bug and setup up the benchmarks around that.
So it's much worse than I assumed from paper and repo overview?
For further clarification: 1. See the issue example #14268 https://github.com/openai/SWELancer-Benchmark/tree/08b5d3dff.... It has a patch that is supposed to "reintroduce" the bug into the codebase (note the comments):
Also, the patch is supposedly applied over commit da2e6688c3f16e8db76d2bcf4b098be5990e8968 - much later than original fix, but also a year ago, not sure why, might be something to do with cut off dates.
2. Proceed to https://github.com/Expensify/App/issues/14268 to see the actual original issue thread.
3. Here is the actual merged solution at the time: https://github.com/Expensify/App/pull/15501/files#diff-63222... - as you can see the diff is quite different... Not only that, but the point to which the "bug" was reapplied is so far to the future that repo migrated to typescript even.
---
And they still had to add a whole another level of bullshit with "management" tasks on top of that, guess why =)
Prior "bench" analysis for reference: https://arxiv.org/html/2410.06992v1
(edit: code formatting)
I'm not quite sure what your issue with the reintroducing bugs is? How else do you expect them to build a test suite?
My issue is that its not the original bug that is being reintroduced (or the original code checked out at that point), but rather trivialized approximations of how the bug was presenting itself.