Comment by CuriouslyC

20 days ago

Until we solve the validation problem, none of this stuff is going to be more than flexes. We can automate code review, set up analytic guardrails, etc, so that looking at the code isn't important, and people have been doing that for >6 months now. You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.

There are higher and lower leverage ways to do that, for instance reviewing tests and QA'ing software via use vs reading original code, but you can't get away from doing it entirely.

I agree with this almost completely. The hard part isn’t generation anymore, it’s validation of intent vs outcome. Especially once decisions are high-stakes or irreversible, think pkg updates or large scale tx

What I’m working on (open source) is less about replacing human validation and more about scaling it: using multiple independent agents with explicit incentives and disagreement surfaced, instead of trusting a single model or a single reviewer.

Humans are still the final authority, but consensus, adversarial review, and traceable decision paths let you reserve human attention for the edge cases that actually matter, rather than reading code or outputs linearly.

Until we treat validation as a first-class system problem (not a vibe check on one model’s answer), most of this will stay in “cool demo” territory.

  • “Anymore?” After 40 years in software I’ll say that validation of intent vs. outcome has always been a hard problem. There are and have been no shortcuts other than determined human effort.

    • I don’t disagree. After decades, it’s still hard which is exactly why I think treating validation as a system problem matters.

      We’ve spent years systematizing generation, testing, and deployment. Validation largely hasn’t changed, even as the surface area has exploded. My interest is in making that human effort composable and inspectable, not pretending it can be eliminated.

    • Specification languages need big investments essentially - both in technical and educational terms.

      Consider something like TLA+. How can we make things such as that - be useful in an LLM orchestration framework, be human friendly - that'd be the question I ask.

      So the developer will verify just the spec, and let the LLM match against it in a tougher way than it is possible to do now.

But, is that different from how we already work with humans? Typically we don't let people commit whatever code they want just because they're human. It's more than just code reviews. We have design reviews, sometimes people pair program, there are unit tests and end-to-end tests and all kinds of tests, then code review, continuous integration, Q&A. We have systems to watch prod for errors or user complaints or cost/performance problems. We have this whole toolkit of process and techniques to try to get reliable programs out of what you must admit are unreliable programmers.

The question isn't whether agentic coders are perfect. Actually it isn't even whether they're better than humans. It's whether they're a net positive contribution. If you turn them loose in that kind of system, surrounded by checks and balances, does the system tend to accumulate bugs or remove them? Does it converge on high or low quality?

I think the answer as of Opus 4.5 or so is that they're a slight net positive and it converges on quality. You can set up the system and kind of supervise from a distance and they keep things under control. They tend to do the right thing. I think that's what they're saying in this article.

AI also quickly goes off the rails, even the Opus 2.6 I am testing today. The proposed code is very much rubbish, but it passes the tests. It wouldn't pass skilled human review. Worst thing is that if you let it, it will just grow tech debt on top of tech debt.

  • The code itself does not matter. If the tests pass, and the tests are good, then who cares? AI will be maintaining the code.

    • Next iterations of models will have to deal with that code, and it would be harder and harder to fix bugs and introduce features without triggering or introducing more defects.

      Biological evolution overcomes this by running thousands and millions of variations in parallel, and letting the more defective ones to crash and die. In software ecosystems, we can't afford such a luxury.

    • An example: it had a complete interface to a hash map. The task was to delete elements. Instead of using the hash map API, it iterated through the entire underlying array to remove a single entry. The expected solution was O(1), but it implemented O(n). These decisions compound. The software may technically work, but the user experience suffers.

      4 replies →

    • That's assuming no human would ever go near the code, and that over time it's not getting out of hand (inference time, token limits are all a thing), and that anti-patterns don't get to where the code is a logical mess which produces bugs through a webbing of specific behaviors instead of proper architecture.

      However I guess that at least some of that can be mitigated by distilling out a system description and then running agents again to refactor the entire thing.

      5 replies →

This obviously depends on what you are trying to achieve but it’s worth mentioning that there are languages designed for formal proofs and static analysis against a spec, and I have suspicions we are currently underutilizing them (because historically they weren’t very fun to write, but if everything is just tokens then who cares).

And “define the spec concretely“ (and how to exploit emerging behaviors) becomes the new definition of what programming is.

  • > “define the spec concretely“

    (and unambiguously. and completely. For various depths of those)

    This always has been the crux of programming. Just has been drowned in closer-to-the-machine more-deterministic verbosities, be it assembly, C, prolog, js, python, html, what-have-you

    There have been a never ending attempts to reduce that to more away-from-machine representation. Low-code/no-code (anyone remember Last-one for Apple ][ ?), interpreting-and/or-generating-off DSLs of various level of abstraction, further to esperanto-like artificial reduced-ambiguity languages... some even english-like..

    For some domains, above worked/works - and the (business)-analysts became new programmers. Some companies have such internal languages. For most others, not really. And not that long ago, the SW-Engineer job was called Analyst-programmer.

    But still, the frontier is there to cross..

    • Code is always the final spec. Maybe the "no engineers/coders/programmers" dream will come true, but in the end, the soft, wish-like, very undetailed business "spec" has to be transformed into hard implementation that covers all (well, most of) corners. Maybe when context size reaches 1G tokens and memory won't be wiped every new session? Maybe after two or three breakthrough papers? For now, the frontier isn't reached.

      1 reply →

This is what we're working on at Speedscale. Our methods use traffic capture and replay to validate what worked before still works today.

did you read the article?

>StrongDM’s answer was inspired by Scenario testing (Cem Kaner, 2003).

  • Tests are only rigorous if the correct intent is encoded in them. Perfectly working software can be wrong if the intent was inferred incorrectly. I leverage BDD heavily, and there a lot of little details it's possible to misinterpret going from spec -> code. If the spec was sufficient to fully specify the program, it would be the program, so there's lots of room for error in the transformation.

    • Then I disagree with you

      > You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.

      You don't need a human who knows the system to validate it if you trust the LLM to do the scenario testing correctly. And from my experience, it is very trustable in these aspects.

      Can you detail a scenario by which an LLM can get the scenario wrong?

      15 replies →

    • > If the spec was sufficient to fully specify the program, it would be the program

      Very salient concept in regards to LLM's and the idea that one can encode a program one wishes to see output in natural English language input. There's lots of room for error in all of these LLM transformations for same reason.