← Back to context

Comment by codingwagie

6 days ago

I'm seeing big advances that arent shown in the benchmarks, I can simply build software now that I couldnt build before. The level of complexity that I can manage and deliver is higher.

A really important thing is the distinction between performance and utility.

Performance can improve linearly and utility can be massively jumpy. For some people/tasks performance can have improved but it'll have been "interesting but pointless" until it hits some threshold and then suddenly you can do things with it.

Yeah I kind of feel like I'm not moving as fast as I did, because the complexity and features grow - constant scope creep due to moving faster.

I am finding that my ability to use it to code, aligns almost perfectly with increasing token memory.

yeah, the benchmarks are just a proxy. o3 was a step change where I started to really be able to build stuff I couldn't before

mind telling examples?

  • Not OP, but a couple of days ago I managed to vibecode my way through a small app that pulled data from a few services and did a few validation checks. By itself its not very impressive, but my input was literally "this is how the responses from endpoint A,B and C look like. This field included somewhere in A must be somewhere in the response from B, and the response from C must feature this and that from response A and B. If the responses include links, check that they exist". To my surprise, it generated everything in one go. No retry nor Agent mode churn needed. In the not so distant past this would require progressing through smaller steps, and I had to fill in tests to nudge Agent mode to not mess up. Not today.

Okay but this has all to do with the tooling and nothing to do with the models.

  • I mostly disagree with this.

    I have been using 'aider' as my go to coding tool for over a year. It basically works the same way that it always has: you specify all the context and give it a request and that goes to the model without much massaging.

    I can see a massive improvement in results with each new model that arrives. I can do so much more with Gemini 2.5 or Claude 4 than I could do with earlier models and the tool has not really changed at all.

    I will agree that for the casual user, the tools make a big difference. But if you took the tool of today and paired it with a model from last year, it would go in circles

  • Can you explain why?

    • You can write projects with LLMs thanks to tools that can analyze your local project's context, which didn't exist a year ago.

      You could use Cursor, Windsurf, Q CLI, Claude Code, whatever else with Claude 3 or even an older model and you'd still get usable results.

      It's not the models which have enabled "vibe coding", it's the tools.

      An additional proof of that is that the new models focus more and more on coding in their releases, and other fields have not benefited at all from the supposed model improvements. That wouldn't be the case if improvements were really due to the models and not the tooling.

      7 replies →