Comment by jbmilgrom
8 hours ago
that's right, and agents turning specs into software can go in all sorts of directions especially when we don't control the input.
what we've done to mitigate is essentially backing every entrypoint (customer comment, internal ticket, etc) with a remote claude code session with persistent memory - that session essentially becomes the expert in the case. And we've developed checkpoints that work from experience (e.g. the root cause one) where a human has the opportunity to take over the wheel so to speak and drive in a different direction with all the context/history up to that point.
basically, we are creating a assembly line where agents do most of the work and humans increasingly less and less as we continue to optimize the different parts of assembly
as far as techniques, it's all boring engineering
* Temporal workflow for managing the lifecycle of a session
* complete ownership of the data model e2e. we dont use Linear for example; we built our own ticketing system so we could represent Temporal signals, github webhooks and events from the remote claude sessions exactly how we wanted
* incremental automation gains over and over again. We do a lot of the work manually first (like old fashioned hand coding lol) before trying to automate so we become experts in that piece of the assembly line and it becomes obvious how to incrementally automate...rinse and repeat
Ooh, it sounds like you've already got most of the groundwork done for something I was wondering about yesterday: I'd love it if there was some way during an incident, for some system to pull all the PRs included in the latest release, check which agents worked on them (i.e. line in the commit message with an identifier that corresponds to the agent's LLM context and any other data at the time of commit), "rehydrate" these agents from the corresponding stored context, feed them the relevant incident data, and ask if it could be related to their changes and what to do about it.
In most cases it might not be much more valuable than just looking through the diffs from scratch with a new agent, but there are probably going to be some cases where a rehydrated agent is like "Doh, I meant to do X but it looks like I hallucinated Y instead. Here's a PR to fix it!"
I know that's just a small piece of what you're doing, but I think it's something that would be valuable on its own, and soon something that is likely to be "standard infrastructure" for any company that does even a little agentic coding (assuming it works). It'd probably even be "required infrastructure" in regulated industries; the fact that all these agent contexts are ephemeral has to be a red flag from a regulatory perspective.