Comment by iloveoof
20 hours ago
Software engineering has always worked this way, just not to ICs.
“The LLMs produce non-deterministic output and generate code much faster than we can read it, so we can’t seriously expect to effectively review, understand, and approve every diff anymore. But that doesn’t necessarily mean we stop being rigorous, it could mean we should move rigor elsewhere.“
Direct reports, when delegated tasks by managers, product non-deterministic outputs much faster than team leads/managers can review, understand or approve every diff. Being a manager of software developers has always been a non-deterministic form of software engineering.
Simon Willison made a similar parallel recently:
https://simonwillison.net/2026/May/6/vibe-coding-and-agentic...
Suppose the image resize service has some caching, and due to a bug in the caching, under certain circumstances it will respond with an already-cached resized version of a different source image.
Let's say for example it caches on something stupid like the CRC32 of the input image -- good enough that the couple dozen images in your test dataset don't collide, you don't see it in smoke testing your app, but real world data has collisions on a daily basis.
This gets into production and customer A sees a resized version of customer B's document for a thumbnail. Now customer A is wondering how many other customers are seeing resized versions of their private documents in thumbnail images. They are very very mad.
If the image resize service was built by "another team" then that other team is responsible for the bug and will take most of the heat for it. If it was built by an "agent swarm" or "gas town" or whatever under my direction then I'm 100% responsible for it and rightly deserve the heat.
That is why I cannot understand any approach that doesn't involve reading the code at all. Testing alone is not sufficient. MTTR is not sufficient because you can't make a customer less mad about a data privacy bug by fixing it.
Practically, this is just about confidence values, anticipated blast radius and balancing testing vs review overhead.
This leaves out the part where you ask the original developer: "Why does this thing do that?"
But then, the ownership is clear. And no team would be like to be pointed that their 5th iteration is also broken and can’t be relied for production usage. That’s the difference with AI code. LLM are not aligned with your goals. Any trust in them doing the right thing is very misguided.
That's why you have them write tons of tests. Way more than you generally would for human written code. And the agent writing/maintaining the tests is not the agent fixing the bugs.
I've personally had a LLM write an image resizing library for me. It's a fairly basic one, I didn't need anything fancy. I could have used something off the shelf but it was at a time when I was testing what Claude could do. And to be honest, it just worked. One shot, if I recall correctly, or at least, one session with a few tweaks and never touched again. It's been embedded in a larger app for several months and I don't recall hitting a single bug with that, specifically. So I'm not sure your complaints about "the 5th iteration" being broken have much grounds here.
1 reply →
Simon Willison’s analogy does not apply unless that other team was immediately fired after they delivered the image resize service, or (more commonly) was done by a one off contractor. The difference is the trust model. We trust that our company has hired a competent team which maintains knowledge of the image resizing service, that they respond to bug reports and feature requests and that they know how to fix and implement those.
Now I have been on HN long enough to know that we used to despise code written by contractors which we now depend on.
Why does the team need to be "fired"?
The single person who did the service might just quit and go to another job. They might be external consultants that rotate away when the contract ends. It might be a SaaS service where you don't control the code at all - nor the composition of their team.
We have trusted services, contractors and teams within our companies before. Now suddenly _everyone_ has ALWAYS read and meticulously analyzed every single line of code they have ever imported to a project?
2 replies →
No, because those direct reports can use tools to build deterministic software. LLMs can't, because they themselves are non-deterministic. They will say they did, and they will be wrong. And the LLM you have check will also say it did, and it will be wrong. Etc etc.
These things just can't be in the critical path. They are ridiculously unreliable.
> Being a manager of software developers has always been a non-deterministic form of software engineering.
I disagree. Being a manager of programmers requires that you trust your programmers and have some way of occasionally verifying the correctness and efficacy of what they build to make sure that that trust is still properly placed.
But on top of that, the user of an LLM isn't really akin to a manager of programmers. Human programmers are responsible for what they write [0], and even the ones that only cost ~50% of a senior's total comp [1] are still going to be able to fairly reliably explain to you why they made the decisions they did, and fairly reliably be able to follow instruction. LLMs just aren't there yet, and the major LLM providers may never care to get them there.
A programmer who's using LLMs is a programmer who's using LLMs... not a manager of other programmers. I'm not going to say that the tech will never advance to that point, but it's simply not there yet.
[0] Unless management decides otherwise, of course.
[1] Nvidia's CEO recently mentioned that he'd be "deeply alarmed" if senior staff aren't spending at least half of their total compensation [2] on LLM providers, so I'm going to use that as my benchmark for "expected annual LLM spend".
[2] ...meaning that each senior programmer costs their employer at least 50% more than their total compensation....
> Being a manager of software developers has always been a non-deterministic form of software engineering.
Unless the manager is also a principal/architect, I don’t find this to be agreeable.
It’s similar to saying that you are a non-deterministic chef when you order food from a restaurant.
Well yes but if no humans at the company understand the code then no one is truly responsible for it.
what about the artifacts that were supposed to test the correctness of the code? are they passing willy nilly?
No amount of testing will save a large program with a dogshit architecture. Roughly, this is because tests increase coverage linearly with the number of tests, but weird interactions increase exponentially with code size.
This might be fine if you're building a tiny app, or if you're building a medium-sized app that follows a strict existing architecture (like a web app consisting mostly of forms). In which case, have fun.
But if you're building something slightly novel and interesting, then Claude is surprisingly bad at architecture and taste, and it tends to "fix" problems by spewing more slop. What you need instead is actual insight that leads to simplifying principles. This, in turn, allows breaking up the exponential complexity into disciplined patterns. This allows your code complexity to scale far more slowly, allowing an essentially linear number of tests to provide coverage.
I actually download and try people's vibe-coded developer tools. And frankly, those tools are some of the worst software I've used in my life, worse than even Unix-vendor Motif implementations from the early 90s.
Like, I'm super happy that people can vibe-code themselves simple, one-off personal tools. That's incredibly empowering. But that doesn't mean you can big, novel stuff the same way without a competent human actively in the loop.
3 replies →