Whether a problem is "good" or "bad" is not always objective or simple.
For example, you can have problems that are underspecified, with hardcoded tests for a particular solution (out of multiple possible solutions). If your solution works fine but used a different function name than the one hardcoded in the tests, you can unfairly score 0.
When an eval has underspecified problems like these, you can still score 100% if you remember the original solution from your training data or if you just have taste similar to the original human authors. And both of these qualities - good memory and good taste - are great, but they'll be rewarded unfairly relative to a model that still did exactly what it was asked but in a different way than the hardcoded tests expected.
It is not impossible to solve in absolute terms, in the sense, all necessary pieces of information are presented in the repo + problem statement.
But it is impossible to solve in the sense, unless you read the ground truth, you are NOT able to solve it the way the test patch demands.
Simply not plausible to me that model can read the problem statement so precisely that it nails exactly, like 100% what the test suite is trying to test.
Whether a problem is "good" or "bad" is not always objective or simple.
For example, you can have problems that are underspecified, with hardcoded tests for a particular solution (out of multiple possible solutions). If your solution works fine but used a different function name than the one hardcoded in the tests, you can unfairly score 0.
When an eval has underspecified problems like these, you can still score 100% if you remember the original solution from your training data or if you just have taste similar to the original human authors. And both of these qualities - good memory and good taste - are great, but they'll be rewarded unfairly relative to a model that still did exactly what it was asked but in a different way than the hardcoded tests expected.
To some extent yes.
It is not impossible to solve in absolute terms, in the sense, all necessary pieces of information are presented in the repo + problem statement.
But it is impossible to solve in the sense, unless you read the ground truth, you are NOT able to solve it the way the test patch demands.
Simply not plausible to me that model can read the problem statement so precisely that it nails exactly, like 100% what the test suite is trying to test.