It isn't. Anthropic tried building a fairly simple piece of software (a C compiler) with a full spec, thousands of human-written tests, and a reference implementation - all of which were made available to the agent and the model trained on. It's hard to imagine a better tested, better-specified project, and we're talking about 20KLOC. Their agents worked for two weeks and produced a 100KLOC codebase that was unsalvageable - any fix to one thing broke another [1]. Again, their attempt was to write software that's smaller, better tested, and better specified than virtually any piece of real software and the agents still failed.
Today's agents are simply not capable enough to write evolvable software without close supervision to save them from the catastrophic mistakes they make on their own with alarming frequency.
Specifically, if you look at agent-generated code, it is typically highly defensive, even against bugs in its own code. It establishes an invariant and then writes a contingency in case the invariant doesn't hold. I once asked it to maintain some data structure so that it could avoid a costly loop. It did, but in the same round it added a contingency (that uses the expensive loop) in the code that consumes the data structure in case it maintained it incorrectly.
This makes it very hard for both humans and the agent to find later bugs and know what the invariants are. How do you test for that? You may think you can spec against that, but you can't, because these are code-level invariants, not behavioural invariants. The best you can do is ask the agent to document every code-level invariant it establishes and rely on it. That can work for a while, but after some time there's just too much, and the agent starts ignoring the instructions.
I think that people who believe that agents produce fine-but-messy code without close supervision either don't carefully review the code or abandon the project before it collapses. There's no way people who use agents a lot and supervise them closely believe they can just work on their own.
Lol I largely agree with my beloved dissenters, just not on the same magnitude. I understand complete specs are impossible and equivalent to source code via declaration. My disagreement is with this particular part:
"t's more like asking the agent to build a building from floorplans and spec, and it produces everything in the right measurements and right colours and passes all tests. Except then you find out that the walls and beams are made of foam and the art is load-bearing. "
If your test/design of a BUILDING doesn't include at simulations/approximations of such easy to catch structural flaws, its just bad engineering. Which rhymes a lot with the people that hate AI. By and large, they just don't use it well.
"Incomplete specs" is the way of the world. Even highly engineered projects like buildings have "incomplete specs" because the world is unpredictable and you simply cannot anticipate everything that might come up.
And sometimes it can't even handle it then. I was recently porting ruby web code to python. Agents were simultaneously surprisingly good (converting ActiveRecord to sqlalchemy ORM) and shockingly, incapably bad.
For example, ruby uses blocks a lot. Ruby blocks are curious little thingies because they are arguably just syntax sugar for a HOF, but man it's great syntax sugar. Python then has "yield" which is simultaneously the same keyword ruby uses for blocks, but works fundamentally differently (instead of just a HOF, it's for generating an iterator/generator) and while there are some decorators that can use yield's ability to "pause" execution in the function to send control flow back out of the function for a moment (@contextmanager) which feels _even more_ like ruby blocks, it's a rather limited trick and requires the decorator to adapt the Generator to a context manager and there's just no good way to generalize that.
Somehow this is the perfect storm to make LLMs completely incapable of converting ruby code that uses blocks for more than the basic iteration used in the stdlib. It will try to port to python code that is either nonsensical, or uses yield incorrectly and doesn't actually work (and in a way that type checkers can even spot). And furthermore, even if you can technically whack it with a hammer until it works with yield, it's often not at all the way to do it. Ruby devs use blocks not-uncommonly while python devs are not really going to be using yield often at all, perhaps outside of @contextmanager. So the right move is usually to just restructure control flow to not need to use blocks/HOFs (or double down and explicitly pass in a function). (Rubyists will cringe at this, and rightly so... Ruby is often extraordinarily expressive).
The fact that such a simple language feature trips them up so completely is pretty odd to me. I guess maybe their training data doesn't include a lot of ruby-to-python conversions. Maybe that's indicative of something, but I digress.
It isn't. Anthropic tried building a fairly simple piece of software (a C compiler) with a full spec, thousands of human-written tests, and a reference implementation - all of which were made available to the agent and the model trained on. It's hard to imagine a better tested, better-specified project, and we're talking about 20KLOC. Their agents worked for two weeks and produced a 100KLOC codebase that was unsalvageable - any fix to one thing broke another [1]. Again, their attempt was to write software that's smaller, better tested, and better specified than virtually any piece of real software and the agents still failed.
Today's agents are simply not capable enough to write evolvable software without close supervision to save them from the catastrophic mistakes they make on their own with alarming frequency.
Specifically, if you look at agent-generated code, it is typically highly defensive, even against bugs in its own code. It establishes an invariant and then writes a contingency in case the invariant doesn't hold. I once asked it to maintain some data structure so that it could avoid a costly loop. It did, but in the same round it added a contingency (that uses the expensive loop) in the code that consumes the data structure in case it maintained it incorrectly.
This makes it very hard for both humans and the agent to find later bugs and know what the invariants are. How do you test for that? You may think you can spec against that, but you can't, because these are code-level invariants, not behavioural invariants. The best you can do is ask the agent to document every code-level invariant it establishes and rely on it. That can work for a while, but after some time there's just too much, and the agent starts ignoring the instructions.
I think that people who believe that agents produce fine-but-messy code without close supervision either don't carefully review the code or abandon the project before it collapses. There's no way people who use agents a lot and supervise them closely believe they can just work on their own.
[1]: https://www.anthropic.com/engineering/building-c-compiler
Lol I largely agree with my beloved dissenters, just not on the same magnitude. I understand complete specs are impossible and equivalent to source code via declaration. My disagreement is with this particular part:
"t's more like asking the agent to build a building from floorplans and spec, and it produces everything in the right measurements and right colours and passes all tests. Except then you find out that the walls and beams are made of foam and the art is load-bearing. "
If your test/design of a BUILDING doesn't include at simulations/approximations of such easy to catch structural flaws, its just bad engineering. Which rhymes a lot with the people that hate AI. By and large, they just don't use it well.
"Incomplete specs" is the way of the world. Even highly engineered projects like buildings have "incomplete specs" because the world is unpredictable and you simply cannot anticipate everything that might come up.
A sufficiently complete spec is indistinguishable from source code.
And sometimes it can't even handle it then. I was recently porting ruby web code to python. Agents were simultaneously surprisingly good (converting ActiveRecord to sqlalchemy ORM) and shockingly, incapably bad.
For example, ruby uses blocks a lot. Ruby blocks are curious little thingies because they are arguably just syntax sugar for a HOF, but man it's great syntax sugar. Python then has "yield" which is simultaneously the same keyword ruby uses for blocks, but works fundamentally differently (instead of just a HOF, it's for generating an iterator/generator) and while there are some decorators that can use yield's ability to "pause" execution in the function to send control flow back out of the function for a moment (@contextmanager) which feels _even more_ like ruby blocks, it's a rather limited trick and requires the decorator to adapt the Generator to a context manager and there's just no good way to generalize that.
Somehow this is the perfect storm to make LLMs completely incapable of converting ruby code that uses blocks for more than the basic iteration used in the stdlib. It will try to port to python code that is either nonsensical, or uses yield incorrectly and doesn't actually work (and in a way that type checkers can even spot). And furthermore, even if you can technically whack it with a hammer until it works with yield, it's often not at all the way to do it. Ruby devs use blocks not-uncommonly while python devs are not really going to be using yield often at all, perhaps outside of @contextmanager. So the right move is usually to just restructure control flow to not need to use blocks/HOFs (or double down and explicitly pass in a function). (Rubyists will cringe at this, and rightly so... Ruby is often extraordinarily expressive).
The fact that such a simple language feature trips them up so completely is pretty odd to me. I guess maybe their training data doesn't include a lot of ruby-to-python conversions. Maybe that's indicative of something, but I digress.
We call the complete specs "source code".
... and it still doesn't work. In the Anthropic experiement, the model was trained on a reference implementation and the agents still failed.