Comment by Zakodiac
19 days ago
The Digital Twin Universe is the most interesting thing in this article and the part most people are glossing over. The real question Simon nails is: how do you prove software works when both the implementation and the tests are written by agents? Because agents will absolutely game your test suite - return true, rewrite assertions to match broken output, whatever gets them to green.
Their answer of keeping scenarios external to the codebase like a holdout set is smart. And building full behavioral clones of services like Okta, Jira, Slack so you can run thousands of end to end scenarios without hitting rate limits or production - that's where the actual hard engineering work is. Not the code generation, the validation infrastructure.
Most teams trying this will skip that part because it's expensive and unglamorous. They'll let agents write code and tests together and wonder why things break in production. The "factory" part isn't the agents writing code. It's having robust enough external proof that the code does what it's supposed to.
(DTU creator here)
I did have an initial key insight which led to a repeatable strategy to ensure a high level of fidelity between DTU vs. the official canonical SaaS services:
Use the top popular publicly available reference SDK client libraries as compatibility targets, with the goal always being 100% compatibility.
You've also zeroed in on how challenging this was: I started this back in August 2025 (as one of many projects, at any time we're each juggling 3-8 projects) with only Sonnet 3.5. Much of the work was still very unglamorous, but feasible. Especially Slack, in some ways Slack was more challenging to get right than all of G-Suite (!).
Now I'm part way through reimplementing the entire DTU in Rust (v1 was in Go) and with gpt-5.2 for planning and gpt-5.3-codex for execution it's significantly less human effort.
IMO the most novel part to this story is Navan's Attractor and corresponding NLSpec. Feed in a good Definition-of-Done and it'll bounce around between nodes until it gets it right. There are already several working implementations in less than 24 hours since it was released, one of which is even open source [0].
[0] https://github.com/danshapiro/kilroy
Been toying around with DTs myself for a few months. Until December, LLMs couldn't correctly hold large amounts of modeled behavior internally.
Why the switch from Go to Rust?
I'm testing a theory that large-scale (LoC) generated projects in Rust tend to have fewer functional bugs compared to e.g. Go or Java because Rust as a language is a little stricter.
I've not yet formed a full opinion or conclusion, but in general I'm starting to prefer Rust.
Re: generalizing mocks, it sounds interesting but after getting full-fidelity clones of so many multi-billion dollar SaaS offerings, I really like it and am hooked. It pays nice dividends for developing using agentic coders at high scale. In a few more model releases having your own exhaustive DTU could become trivial.
1 reply →
Are the digital twins open source anywhere, or available as a service somehow? They sound useful to use!
[dead]
> The Go to Rust rewrite is interesting - was that driven by performance or more about the ecosystem/tooling for this kind of work?
I'm testing a theory that large-scale (LoC) generated projects in Rust tend to have fewer functional bugs compared to e.g. Go or Java because Rust as a language is a little stricter.
I've not yet formed a full opinion or conclusion, but in general I'm starting to prefer Rust.
Re: generalizing mocks, it sounds interesting but after getting full-fidelity clones of so many multi-billion dollar SaaS offerings, I really like it and am hooked. It pays nice dividends for developing using agentic coders at high scale. In a few more model releases having your own exhaustive DTU could become trivial.
2 replies →
Am I growing too paranoid, or are you using AI to generate the comments posted on this account?
7 replies →
At first I was partially impressed by the Digital Twin Universe thing they describe. Having worked with 3rd party APIs in a previous life, having something like that would've been so much help.
But after thinking about it more, I think it must be the lowest of low hanging fruits for LLMs. You're building something with well defined specs, most of which is readily available by the original creators, with a UI that only does the bare minimum, and it doesn't need any long-term features like reliability since it's all for internal short-lived use. On top of that, it looks super impressive when used in a demo, because all those applications being mocked are very complicated pieces of software. So to recreate a thin facade of them can look very impressive. And calling it a "Digital Twin Universe" is just icing on the cake.
It is suggesting that we will move towards an “everything must have an API” world.
But at some point you get back to tests, because they are simpler to write.
This is a child of the “no handwritten code” rule. Since they can’t steer tests, they have to do something else to ensure quality.
This is only worth it if the added cost and overhead is cheaper than writing the code.
This seems like it will pull towards building a simulation of your firm, for the simulation to work? Or simulations of your process?
Strongly inclined to agree here: Having recently joined a small applied AI startup and we were discussing the need for E2E tests. My initial gut reaction (which I kept quiet) was that such things turn into unmaintainable messes which delay releases and increasingly reduce in value.
I recognised this was grounded in an entirely different world of software engineering and organisation size though. I followed a path of thinking about what went wrong historically and how might they be solved: Better structure, discipline, resources - all of the things which agentic AI facilitates.
You are right about most skipping this part: But I view it as being like a sewerage and sanitation system - largely invisible and not thought about but critical for long-term health.
Also this ties in very nicely with Netflix's approach to Chaos Engineering and enabling it at broader scale.
> You are right about most skipping this part: But I view it as being like a sewerage and sanitation system - largely invisible and not thought about but critical for long-term health.
And like sewage and sanitation the infrastructure is a lot more complicated than people think.
I’m curious what happens when they need to make a DRU of Stripe or another payment processor.
You have a different agent write the tests and another run the tests. You tell them each that they aren’t checking their own work, they’re checking someone else’s. You can tell them to be skeptical. Then you can also tell them that don’t fail the code for no reason, because a third agent will be checking your tests and you will be penalized for inaccurate testing.
This approach balances out and maximizes accuracy.
> just use psychological tricks on the LLM, bro, you'll cajole it into not hallucinating
Can't help but chuckle at that.
If it sounds stupid but it works, it's not stupid.
2 replies →
I don't think this is meaningfully different than the human case for the past 20 years. Every large project I've worked on had people writing tests that didn't test anything and people who argued strongly when I pointed out glaring missing test functionality coverage. And their managers did not like to spend money on having better tests written.
High-quality digital twins of complex software does not bode well at all for a lot of SaaS companies.
For customers, it makes migrations much easier and less-risky between vendors.
For the vendors themselves, it means you can cheaply and reliably port features your competitors have that you don’t have.
[dead]