Comment by killerstorm

9 months ago

Bullshit. We have absolute numbers, not just vibes.

The top of SWE-bench Verified leaderboard was at around 20% in mid-2024, i.e. AI was failing at most tasks.

Now it's at 70%.

Clearly it's objectively better at tackling typical development tasks.

And it's not like it went from 2% to 7%.

2 comments

killerstorm

lexandstuff 9 months ago

Isn't SWE-bench based on public Github issues? Wouldn't the increase in performance also be explained by continuing to train on newer scraped Github data, aka training on the test set?

The pressure for AI companies to release a new SOTA model is real, as the technology rapidly become commoditised. I think people have good reason to be skeptical of these benchmark results.

killerstorm 9 months ago

That sounds like a conspiracy theory. If it was just some mysterious benchmark and nothing else then sure, you have reasons to be skeptical.
But there's a plenty of people who actually tried LLMs for actual work and swear they work now. Do you think they are all lying?..
Many people with good reputation, not just noobs.