Comment by simonw
13 hours ago
This is genuinely one of the most interesting questions right now. I don't have solid answers yet, and I'm very keen to learn what people are finding works.
If you accelerate the pace of code creation it inevitably creates bottlenecks elsewhere. Code review is by far the biggest of those right now.
There may be an argument for leaning less on code review. When code is expensive to produce and is likely to stay in production for many years it's obviously important to review it very carefully. If code is cheap and can be inexpensively replaced maybe we can lower our review standards?
But I don't want to lower my standards! I want the code I'm producing with coding agents to be better than the code I would produce without them.
There are some aspects of code review that you cannot skimp on. Things like coding standards may not matter as much, but security review will never be optional.
I've recently been wondering what we can learn from security teams at large companies. Once you have dozens or hundreds of teams shipping features at the same time - teams with varying levels of experience - you can no longer trust those teams not to make mistakes. I expect that the same strategies used by security teams at Facebook/Google-scale organizations could now be relevant to smaller organizations where coding agents are responsible for increasing amounts of code.
Generally though I think this is very much an unsolved problem. I hope to document the effective patterns for this as they emerge.
I think Martin Fowler's "Refactoring" might give a bit of insight here. One of my take-aways after reading that book is that the specific implementation of a function is not very important if you have tests. He argues that it can sometimes be easier to completely re-write a function than to take the time to understand it - as long as you can validate that your re-write performs the same way. This mindset lines up pretty closely with how I've been using LLMs.
If that's true, then I would think the emphasis in code review should be more on test quality and verifying that the spec is captured accurately, and as you suggest, the actual implementation is less important.
This is why I've been pushing back on the "just have the AI generate the tests!" mentality. Sure, let it help you, but those tests are the guarantee of quality and fit for purpose. If you vibe code them, how the hell do you know if it even does what you think it does?
You should be planning out the tests to properly exercise the spec, and ensuring those tests actually do what the spec requires. AI can suggest more tests (but be careful here, too, because a ballooned test suite slows down CICD), but it should never be in charge of them completely.
A related book I've been thinking about in terms of LLMs is "Working Effectively With Legacy Code". I'd love to be able to work a lot of that advice into some kind of Skill or customized agent to help with big refactors.
Oh gosh - now that you mention it, it was "Working Effectively with Legacy Code" that I was thinking of, not "Refactoring".
That's my experience with agentic development so far, a lot of extra time goes into testing.
Problem is, the way I've been trained to test isn't exactly antagonistic. QA does that kind of thing. Programmers writing tests are generally rather doing spot checks that only make sense if the code is generally understood and trustworthy. Code LLMs produce is usually broken in subtle, hard to spot ways.
Counter-point, developers that get used to not caring about function implementation, are going to culturally also not care as much about test implementation, making this proposed ideal impossible.
with LLMs, tests cost nearly nothing of effort but provide tremendous value.
3 replies →
> There may be an argument for leaning less on code review. When code is expensive to produce and is likely to stay in production for many years it's obviously important to review it very carefully. If code is cheap and can be inexpensively replaced maybe we can lower our review standards?
Agree with everything else you said except this. In my opinion, this assumes code becomes more like a consumable as code-production costs reduce. But I don't think that's the case. Incorrect, but not visibly incorrect, code will sit in place for years.
> Agree with everything else you said except this.
Yeah, I'm not sure I agree with what I said there myself!
> Incorrect, but not visibly incorrect, code will sit in place for years.
If you let incorrect code sit in place for years I think that suggests a gap in your wider process somewhere.
I'm still trying to figure out what closing those gaps looks like.
The StrongDM pattern is interesting - having an ongoing swarm of testing agents which hammer away at a staging cluster trying different things and noting stuff that breaks. Effectively an agent-driven QA team.
I'm not going to add that to the guide until I've heard it working for other teams and experienced it myself though!
This kinda gets into the idea of AIs as droids right?
So, you have a code writing droid that is aligned towards writing good clean code that humans can read. Then you have an implementation droid that goes into actually launching and running the code and is aligned with business needs and expenses. And you have a QA droid that stress tests the code and is aligned with the hacker mindset and is just slightly evil, so to speak.
Each droid is working together to make good code, but also are independent and adversarial in the day to day.
2 replies →
It assumes that bugs are rare and easy to fix. A look at Claude Code's issue tracker (https://github.com/anthropics/claude-code/issues) tells you that this is not so. Your product could be perpetually broken, lurching from one vibe coded bug to another.
> When code is expensive to produce and is likely to stay in production for many years it's obviously important to review it very carefully. If code is cheap and can be inexpensively replaced maybe we can lower our review standards?
I don't care how cheap it is to replace the incorrect code when it's modifying my bank account or keeping my lights on.
Oh, don't worry, even before AI the companies in question were already outsourcing a lot of this to the cheapest companies they could find. We are just very very lucky most of the problems incurred get caught before being foisted on the wider world.
One model I've seen is moving the review stage to the designs, not the code itself.
I.e. have a `planning/designs/unbuilt/...` folder that contains markdown descriptions of features/changes that would have gotten a PR. Now do the review at the design level.