← Back to context

Comment by agentultra

7 days ago

I don’t think people are good at self-reporting the “boost” it gives them.

We need more empirical evidence. And historically we’re really bad at running such studies and they’re usually incredibly expensive. And the people with the money aren’t interested in engineering. They generally have other motives for allowing FUD and hype about productivity to spread.

Personally I don’t see these tools going much further than where they are now. They choke on anything that isn’t a greenfield project and consistently produce unwanted results. I don’t know what magic incantations and combinations of agents people have got set up but if that’s what they call “engineering,” these days I’m not sure that word has any meaning anymore.

Maybe these tools will get there one day but don’t go holding your breath.

> They choke on anything that isn’t a greenfield project and consistently produce unwanted results.

That was true 8 months ago. It's not true today, because of the one-two punch of modern longer-context "reasoning" models (Claude 4+, GPT-5+) and terminal-based coding agents (Claude Code, Codex CLI).

Setting those loose an an existing large project is a very different experience from previous LLM tools.

I've watched Claude Code use grep to find potential candidates for a change I want to make, then read the related code, follow back the chain of function calls, track down the relevant tests, make a quick detour to fetch the source code of a dependency directly from GitHub (by guessing the URL to the raw file) in order to confirm a detail, make the change, test the change with an ad-hoc "python -c ..." script, add a new automated test, run the tests and declare victory.

That's a different class entirely from what GPT-4o was able to do.

  • I think the thing people have to understand is how fast the value proposition is changing. There is a lot of conversation about "plateauing" model performance, but the actual experience from the combination of the model and tooling changes is night and day in the last 3 months. It was beginning to be very useful with Claude 3.7 in the spring this year, but we have just gone through a step function change.

    I was decomissioning some code and I made the mistake of asking for an "exhaustive" analysis of the areas I needed to remove. Sonnet 4.5 took 30 minutes looking around and compiling a detailed report on exactly what needed to be removed from this very very brownfield project and after I reviewed the report, it one shot the decommisioning of the code (in this case I was using CLaude in the Cursor tooling at work). It was overkill, but impressive how well it mapped all the ramifications in the code base by greping around.

  • Indeed, Codex CLI is quite useful even for demanding tasks. The current problem is that it might gather context for 20 minutes before doing the actual thing. The question is whether this will be sped up significantly.

  • I guess we just have to take your word for this, which is somewhat odd considering most of your comments link back to some artifact of yours. Are you paid by any of these companies?

    • OP is one of the co-creators of Django (for which I am eternally grateful, having built my first company on top of it) and one of the most prolific writers in the space. I also happen to strongly agree with his assessment, though as he said getting that amount of value out of current tools is real work.

      1 reply →

    • That the tools do this kind of thing? They do, they’ll go through pretty long multi step processes to find things and edit them. They run tests, check output, see it’s wrong and go and add debug statements, rerun, try and fix things, rerun, then remove the logging.