Comment by manmal
4 hours ago
Looks like it will be on par with the contenders when it comes to coding. I guess improvements will be incremental from here on out.
4 hours ago
Looks like it will be on par with the contenders when it comes to coding. I guess improvements will be incremental from here on out.
> I guess improvements will be incremental from here on out.
What do you mean? These coding leaderboards were at single digits about a year ago and are now in the seventies. These frontier models are arguably already better at the benchmark that any single human - it's unlikely that any particular human dev is knowledgeable to tackle the full range of diverse tasks even in the smaller SWE-Bench Verified within a reasonable time frame; to the best of my knowledge, no one has tried that.
Why should we expect this to be the limit? Once the frontier labs figure out how to train these fully with self-play (which shouldn't be that hard in this domain), I don't see any clear limit to the level they can reach.
A new benchmark comes out, it's designed so nothing does well at it, the models max it out, and the cycle repeats. This could either describe massive growth of LLM coding abilities or a disconnect between what the new benchmarks are measuring & why new models are scoring well after enough time. In the former assumption there is no limit to the growth of scores... but there is also not very much actual growth (if any at all). In the latter the growth matches, but the reality of using the tools does not seem to say they've actually gotten >10x better at writing code for me in the last year.
Whether an individual human could do well across all tasks in a benchmark is probably not the right question to be asking a benchmark to measure. It's quite easy to construct benchmark tasks a human can't do well in that you don't even need AI to do better.
Your mileage may vary, but for me, working today with the latest version of Claude Code on a non-trivial python web dev project, I do absolutely feel that I can hand over to the AI coding tasks that are 10 times more complex or time consuming than what I could hand over to copilot or windsurf a year ago. It's still nowhere close to replacing me, but I feel that I can work at a significantly higher level.
What field are you in where you feel that there might not have been any growth in capabilities at all?
EDIT: Typo
2 replies →
Google has had a lot of time to optimise for those benchmarks, and just barely made SOTA (or not even SOTA) now. How is that not incremental?
If we're being completely honest, a benchmark is like an honest exam: any set of questions can only be used once when it comes out. Otherwise you're only testing how well people can acquire and memorize exact questions.
If it’s on par in code quality, it would be a way better model for coding because of its huge context window.
Sonnet can also work on 1M context. Its extreme speed is the only thing Gemini has on others.
Can it now in Claude Code and Claude Desktop? When I was using it a couple of months ago it seemed only the API had 1M