← Back to context

Comment by SkyPuncher

6 hours ago

I've noticed this as well. I had some time off in late January/early February. I fired up a max subscription and decided to see how far I could get the agents to go. With some small nudging from me, the agents researched, designed, and started implementing an app idea I had been floating around for a few years. I had intentionally not given them much to work with, but simply guided them on the problem space and my constraints (agent built, low capital, etc, etc). They came up with an extremely compelling app. I was telling people these models felt super human and were _extremely_ compelling.

A month later, I literally cannot get them to iterate or improve on it. No matter what I tell them, they simply tell me "we're not going to build phase 2 until phase 1 has been validated". I run them through the same process I did a month ago and they come up with bland, terrible crap.

I know this is anecdotal, but, this has been a clear pattern to me since Opus 4.6 came out. I feel like I'm working with Sonnet again.

There is a huge difference between greenfield development and working with an existing codebase.

I'm not trying to discredit your experience and maybe it really is something wrong with the model.

But in my experience those first few prompts / features always feel insanely magical, like you're working with a 10x genius engineer.

Then you start trying to build on the project, refactor things, deploy, productize, etc. and the effectiveness drops off a cliff.

  • This has been my (admittedly limited) experience as well. LLMs are great at initial bring-up, good at finding bugs, bad at adding features.

    But I'm optimistic that this will gradually improve in time.

    • The only regularity I can discern in contemporary online debates about LLMs is that for every viewpoint expressed, with probability one someone else will write in with the diametrically opposite experience.

      Today it’s my turn to be that person. Large scientific code base with a bunch of nontrivial, handwritten modules accomplishing distinct, but structurally similar in terms of the underlying computation, tasks. Pointed GPT Pro at it, told it what new functionality I wanted, and it churns away for 40 minutes and completely knocks it out of the park. Estimated time savings of about 3-4 weeks. I’ve done this half a dozen times over the past two months and haven’t noticed any drop off or degradation. If anything it got even better with 5.4.

    • I’ve had good, alternative experience with my sideproject (adashape.com) where most of the codebase is now written by Claude / Codex.

      The codebase itself is architected and documented to be LLM friendly and claude.md gives very strong harnesses how to do things.

      As architect Claude is abysmal, but when you give it an existing software pattern it merely needs to extend, it’s so good it still gives me probably something like 5x feature velocity boost.

      Plus when doing large refactorings, it forgets much fever things than me.

      Inventing new architecture is as hard as ever and it’s not great help there - unless you can point it to some well documented pattern and tell it ”do it like that please”.

  • This isn't the case. I basically did an entire business/project/product exploration before building the first feature.

    Even after deleting everything from the first feature and going back to the checkpoint just before initial development, I can no longer get it to accomplish anything meaningful without my direct guidance.

Same experience here. I was working on some easily testable problem and there was a simple task left. In January I was able to create 90% of the project with Claude, now I cannot make it to pass the last 10% that is just a few enums and some match. Codex was able to do it easily.

> A month later, I literally cannot get them to iterate or improve on it.

Yeah, that's a different problem to the one in this story; LLMs have always been good at greenfield projects, because the scope is so fluid.

Brownfield? Not so much.