← Back to context

Comment by pron

3 hours ago

> A messy codebase is still cheaper to send ten agents through than to staff a team around

People who say that haven't used today's agents enough or haven't looked closely at what they produce. The code they write isn't messy at all. It's more like asking the agent to build a building from floorplans and spec, and it produces everything in the right measurements and right colours and passes all tests. Except then you find out that the walls and beams are made of foam and the art is load-bearing. The entire construction is just wrong, hidden behind a nice exterior. And when you need to add a couple more floors, the agents can't "get through it" and neither can people. The codebase is bricked.

Today's agents are simply not capable enough - without very close and labour-intensive human supervision - to produce code that can last through evolution over any substantial period of time.

Debugging would suffer as well, I assume. There's this old adage that if you write the cleverest code you can, you won't be clever enough to debug it.

There's nothing really stopping agents from writing the cleverest code they can. So my question is, when production goes down, who's debugging it? You don't have 10 days.

Something is missing in the common test suite if this can occur, right?

  • You can spend a lot of time perfecting the test suite to meet your specific requirements and needs, but I think that would take quite a while, and at that point, why not just write the code yourself? I think the most viable approach of today's AI is still to let it code and steer it when it makes a decision you don't like, as it goes along.

  • You have to fight to get agents to write tests in my experience. It can be done, but they don't. I've yet to figure out how get any any agent to use TDD - that is write a test and then verify it fails - once in a while I can get it to write one test that way, but it then writes far more code to make it pass than the test justifies and so is still missing coverage of important edge cases.

    • I have TDD flow working as a part of my tasks structuring and then task completion. There are separate tasks for making the tests and for implementing. The agent which implements is told to pick up only the first available task, which will be “write tests task”, it reliably does so. I just needed to add how it should mark tests as skipped because it’s been conflicting with quality gates.

  • First, it's not "can occur" but does occur 100% of the time. Second, sure, it does mean something is missing, but how do you test for "this codebase can withstand at least two years of evolution"?

The problem is, the MBAs running the ship are convinced AI will solve all that with more datacenters. The fact that they talk about gigawatts of compute tells you how delusional they are. Further, the collateral damage this delusion will occur as these models sigmoid their way into agents, and harnesses and expert models and fine tuned derivatives, and cascading manifold intelligent word salad excercises shouldn't be under concerned.

A lot of that can be overcome by including the need to be able to put more floors on top as part of the spec. Whether it be humans or agents, people rarely specify that one explicitly but treat it as an assumed bit of knowledge.

It goes the other way quite often with people. How often do you see K8s for small projects?

  • > A lot of that can be overcome by including the need to be able to put more floors on top as part of the spec

    I wish it could, but in practice, today's agents just can't do that. About once a week I reach some architectural bifurcation where one path is stable and the other leads to an inevitable total-loss catastrophe from which the codebase will not recover. The agent's success rate (I mostly use Codex with gpt5.4) is about 50-50. No matter what you explain to them, they just make catastrophic mistakes far too often.