Comment by zamadatix
6 hours ago
A new benchmark comes out, it's designed so nothing does well at it, the models max it out, and the cycle repeats. This could either describe massive growth of LLM coding abilities or a disconnect between what the new benchmarks are measuring & why new models are scoring well after enough time. In the former assumption there is no limit to the growth of scores... but there is also not very much actual growth (if any at all). In the latter the growth matches, but the reality of using the tools does not seem to say they've actually gotten >10x better at writing code for me in the last year.
Whether an individual human could do well across all tasks in a benchmark is probably not the right question to be asking a benchmark to measure. It's quite easy to construct benchmark tasks a human can't do well in that you don't even need AI to do better.
Your mileage may vary, but for me, working today with the latest version of Claude Code on a non-trivial python web dev project, I do absolutely feel that I can hand over to the AI coding tasks that are 10 times more complex or time consuming than what I could hand over to copilot or windsurf a year ago. It's still nowhere close to replacing me, but I feel that I can work at a significantly higher level.
What field are you in where you feel that there might not have been any growth in capabilities at all?
EDIT: Typo
Claude 3.5 came out in June of last year, and it is imo marginally worse than the AI models currently available for coding. I do not think models are 10x better than 1 year ago, that seems extremely hyperbolic or you are working in a super niche area where that is true.
Are you using it for agentic tasks of any length? 3.5 and 4.5 are about the same for single file/single snippet tasks, but my observation has been that 4.5 can do longer, more complex tasks that were a waste of time to even try with 3.5 because it would always fail.
1 reply →
I'm in product management focused around networking. I can use the tools to create great mockups in a fraction of a time but the actual turnaround of that into production ready code has not been changing much. The team has been able to build test cases and pipelines a bit more quickly is probably the main gain on getting code written.