Comment by fphilipe

2 days ago

I wonder the same. The answer I usually get from people who do manage is that they don't look at the code – or at least not in detail.

Personally, I always end up tweaking something the agent produced. I wonder if I should let go of that control...

10 comments

fphilipe

InsideOutSanta 2 days ago

Even the newest models, like GPT 5.5, only deliver what I want nine out of ten times. If I didn't catch the remaining 10% of misguided garbage by manually reviewing every change, it would add up really quickly.

debabrata_saha 2 days ago

yeah

stavros 2 days ago

I never look at code. It used to be that it quickly became unmaintainable spaghetti where the agent struggled to make any change at all, but in the past year (and with a three step plan/develop/review workflow), the quality is so good that I basically just don't look at the code any more.

It definitely has fewer bugs than a senior developer, but it really hinges on getting the plan right. 20 minutes of planning and 20 of implementation sounds about right for my workflow as well, just make sure you have GPT as a reviewer. It's very nitpicky and finds lots of bugs.

bogdanoff_2 20 hours ago
I'm starting to agree with you; I found the plan/develop/review workflow to work quite well, but I'm not at the point of not looking at the code at all yet.
I guess you actually review and actively participate in making the plan, you just don't review the code afterwards?
Could you share some more details on the specifics of your workflow? (What models/harnesses? do you use the same or different context windows? How exactly do you run the review, and how do you pass along and act upon the information from the review?) Also, how big are the changes you usually implement with one plan/develop/review cycle?
- stavros 19 hours ago
  
  Sure! Here: https://www.stavros.io/posts/how-i-write-software-with-llms/
  The changes aren't usually very big, basically what you'd put in one ticket. If I need to make large changes, I do them in self-contained stages, if that's possible, otherwise I will tell the LLM to add specific tests in the plan, and I will test thoroughly after.
jappgar 1 day ago
20 minutes planning, 20 minutes coding, 200 minutes review and refactor (includes going for a walk and thinking about the problem deeply).
I know a lot of engineers who skip the last part. They're over confident in their original plan. They're over confident the agent actually fulfilled the plan.
- stavros 1 day ago
  
  You aren't treating this as a question of ROI. Is it worth spending 5x as much to make sure the plan was OK and implemented well? Or is it actually OK if we discover the bug during testing?
  The answer won't be the same for all software, but you're assuming it will be.
materielle 2 days ago
This brings to mind two thoughts:
First, that this is challenging to scale across large orgs. Even if your plans produce high quality code, that isn’t true for everyone. I’m definitely struggling with slop code being collectively mailed to me for review my our 1,000 engineers that were told to use their AI subscription all at once.
I feel like we should be taking “prompt engineering” more seriously. And when people mail me code to review, it should also include the agentic workflow and plan. So that when code isn’t up to quality, and can have a discussion about the prompts used to generate it.
My second thought is related to your senior engineer comment. This isn’t surprising, because in most engineering orgs, seniority is completely unrelated to code quality. In fact, many orgs incentive the opposite: “senior” devs that push out buggy code quickly and push accountability downhill to the junior devs.
- jappgar 1 day ago
  
  I'm so curious to see how other people prompt but literally no one I work with will share it. They might share plans, but they never show the conversation, which is the most crucial part.
  Judging by how they struggle to communicate generally, I can't imagine their prompts are doing much heavy lifting.
- stavros 2 days ago
  
  Eh, everything is challenging to scale across large orgs. Even before LLMs, the code was a huge ball of spaghetti that barely held together. Now we just get there faster.
  About senior engineers, I guess that depends on the org you have experience with. My experience doesn't match yours.