Comment by agentultra

7 days ago

I don’t think people are good at self-reporting the “boost” it gives them.

We need more empirical evidence. And historically we’re really bad at running such studies and they’re usually incredibly expensive. And the people with the money aren’t interested in engineering. They generally have other motives for allowing FUD and hype about productivity to spread.

Personally I don’t see these tools going much further than where they are now. They choke on anything that isn’t a greenfield project and consistently produce unwanted results. I don’t know what magic incantations and combinations of agents people have got set up but if that’s what they call “engineering,” these days I’m not sure that word has any meaning anymore.

Maybe these tools will get there one day but don’t go holding your breath.

9 comments

agentultra

simonw 7 days ago

> They choke on anything that isn’t a greenfield project and consistently produce unwanted results.

That was true 8 months ago. It's not true today, because of the one-two punch of modern longer-context "reasoning" models (Claude 4+, GPT-5+) and terminal-based coding agents (Claude Code, Codex CLI).

Setting those loose an an existing large project is a very different experience from previous LLM tools.

I've watched Claude Code use grep to find potential candidates for a change I want to make, then read the related code, follow back the chain of function calls, track down the relevant tests, make a quick detour to fetch the source code of a dependency directly from GitHub (by guessing the URL to the raw file) in order to confirm a detail, make the change, test the change with an ad-hoc "python -c ..." script, add a new automated test, run the tests and declare victory.

That's a different class entirely from what GPT-4o was able to do.

XenophileJKO 7 days ago

I think the thing people have to understand is how fast the value proposition is changing. There is a lot of conversation about "plateauing" model performance, but the actual experience from the combination of the model and tooling changes is night and day in the last 3 months. It was beginning to be very useful with Claude 3.7 in the spring this year, but we have just gone through a step function change.
I was decomissioning some code and I made the mistake of asking for an "exhaustive" analysis of the areas I needed to remove. Sonnet 4.5 took 30 minutes looking around and compiling a detailed report on exactly what needed to be removed from this very very brownfield project and after I reviewed the report, it one shot the decommisioning of the code (in this case I was using CLaude in the Cursor tooling at work). It was overkill, but impressive how well it mapped all the ramifications in the code base by greping around.
manmal 6 days ago

Indeed, Codex CLI is quite useful even for demanding tasks. The current problem is that it might gather context for 20 minutes before doing the actual thing. The question is whether this will be sped up significantly.
what 7 days ago
I guess we just have to take your word for this, which is somewhat odd considering most of your comments link back to some artifact of yours. Are you paid by any of these companies?
- simonw 7 days ago
  
  I'm not paid by any of them, but I occasionally get preview access to models or invites to events. I attended OpenAI's DevDay on Monday for free, for example.
  I have a disclosures section on my blog here: https://simonwillison.net/about/#disclosures
- csar 6 days ago
  
  OP is one of the co-creators of Django (for which I am eternally grateful, having built my first company on top of it) and one of the most prolific writers in the space. I also happen to strongly agree with his assessment, though as he said getting that amount of value out of current tools is real work.
  
  1 reply →
- mohsen1 6 days ago
  
  https://github.com/bodo-run/yek/pull/213
  here is an example of mostly automated work. It's a small feature but it was done perfectly
- IanCal 7 days ago
  
  That the tools do this kind of thing? They do, they’ll go through pretty long multi step processes to find things and edit them. They run tests, check output, see it’s wrong and go and add debug statements, rerun, try and fix things, rerun, then remove the logging.