Comment by 1a527dd5

1 day ago

This feels very much like "we are now moving the goal posts".

It does, and it should. With each iteration getting closer to the goalposts exposes the flaws in the goalposts, and then you try to make better goalposts. The problem people seem to have with the goalposts moving is they assume the goalpost makers either made good goalposts or thought they made good goalposts, but the actual process is "do the best we can at the moment and update when we get better information".

But this is the good kind of goalpost moving

  • Only if you didn't read the article.

    They're saying they need to move on from it because the benchmark is flawed (without bringing in proof) and that's why they can't hit 100%.

    It's not a "our models are so good that the benchmark is too easy" thing.

    • I feel like they're quite open about why they think the benchmark doesn't work anymore:

      > We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.

      > This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.

    • How can you say “without bringing in proof” when there is literally proof in the article?