> it's about catching when it goes off the rails before it makes a mess
The latest "meta" in AI programming appears to be agent teams (or swarms or clusters or whatever) that are designed to run for long periods of time autonomously.
Through that lens, these changes make more sense. They're not designing UX for a human sitting there watching the agent work. They're designing for horizontally scaling agents that work in uninterrupted stretches where the only thing that matters is the final output, not the steps it took to get there.
That said, I agree with you in the sense that the "going off the rails" problem is very much not solved even on the latest models. It's not clear to me how we can trust a team of AI agents working autonomously to actually build the right thing.
None of those wild experiments are running on a "real", existing codebase that is more than 6 months old. The thing they don't talk about is that nobody outside these AI companies wants to vibe code with a 10 year old codebase with 2000 enterprise customers.
As you as you start to work with a codebase that you care about and need to seriously maintain, you'll see what a mess these agents make.
Even on codebases within the half-year age group, these LLMs often do perform nasty (read: ungodly verbose) implementations that become a maintainability nightmare. Even for the LLMs that wrote it all in the first place. I know this because we've had a steady trickle of clients and prospects expressing "challenges around maintainability and scalability" as they move toward "production readiness". Of course, asking if we can implement "better performing coding agents". As if improved harnessing or similar guardrails can solve what is in my view, a deeper problem.
The practical and opportunistic response is too tell them "Tough cookies" and watch the problems steadily compound into more lucrative revenue opportunities for us. I really have no remorse for these people. Because half of them were explicitly warned against this approach upfront but were psychologically incapable of adjusting expectations or delaying LLM deployment until the technology proved itself. If you've ever had your professional opinion dismissed by the same people regarding you as the SME, you understand my pain.
I suppose I'm just venting now. While we are now extracting money from the dumbassery, the client entitlement and management of their emotions that often comes with putting out these fires never makes for a good time.
I maintain serious code bases and I use LLM agents (and agent teams) plenty -- I just happen to review the code they write, I demand they write the code in a reviewable way, and use them mostly for menial tasks that are otherwise unpleasant timesinks I have to do myself. There are many people like me, that just quietly use these tools to automate the boring chores of dealing with mature production code bases. We are quiet because this is boring day-to-day work.
E.g. I use these tools to clean up or reorganize old tests (with coverage and diff viewers checking of things I might miss), update documentation with cross links (with documentation linters checking for errors I miss), convert tests into benchmarks running as part of CI, make log file visualizers, and many more.
These tools are amazing for dealing with the long tail of boring issues that you never get to, and when used in this fashion they actually abruptly increase the quality of the codebase.
That is also my experience. Doesn't even have to be a 10 year old codebase. Even a 1 year old codebase. Any one that is a serious product that is deployed in production with customers who rely on it.
Not to say that there's no value in AI written code in these codebases, because there is plenty. But this whole thing where 6 agents run overnight and "tada" in the morning with production ready code is...not real.
Also anything that doesn't look like a SaaS app does very badly. We had an internal trial at embedded firmware and concluded the results were unsalvageably bad. It doesn't help that the embedded environment is very unfriendly to standard testing techniques, as well.
I feel like you could have correctly stated this a few months ago, but the way this is "solved" is by multiple agents that babysit each other and review their output - it's unreasonably effective.
You can get extremely good results assuming your spec is actually correct (and you're willing to chew through massive quantities of tokens / wait long enough).
My Claude Code has been running weeks on end churning through a huge task list almost unattended on a complex 15 yr old code base, auto-committing thousands of features. It is high quality code that will go live very soon.
The gas town discord has two people that are doing transformation of extremely legacy in house Java frameworks. Not reporting great success yet but also probably work that just wouldn’t be done otherwise.
Related question: how do we resolve the problem that we sign a blank cheque for the autonomous agents to use however many tokens they deem necessary to respond to your request? The analogy from team management: you don't just ask someone in your team to look into something only to realize three weeks later (in the absence of any updates) that they got nowhere with a problem that you expected to take less than a day to solve.
We'll have to solve for that sometime soon-ish I think. Claude Code has at least some sort of token estimation built-in to it now. I asked it to kick off a large agent team (~100 agents) to rewrite a bunch of SQL queries, one per agent. It did the first 10 or so, then reported back that it would cost too much to do it this way...so it "took the reins" without my permission and tried to convert each query using only the main agent and abandoned the teams. The results were bad.
But in any case, we're definitely coming up on the need for that.
The Bing AI summary tells me that AI companies invested $202.3 billion in AI last year. Users are going to have to pay that back at some point. This is going to be even worse as a cost control situation than AWS.
An AI product manager agent trained on all the experience of product managers setting budgets for features and holding teams to it. Am I joking? I do not know.
This seems pretty in line with how you’d manage a human - you give it a time constraint. a human isn't guaranteed to fix a problem either, and humans are paid by time
We run agent teams (Navigator/Driver/Reviewer roles) on a 71K-line codebase. The trust problem is solved by not trusting the agents at all. You enforce externally. Python gates that block task completion until tests pass, acceptance criteria are verified, and architecture limits are met. The agents can't bypass enforcement mechanisms they can't touch. It's not about better prompts or more capable models. It's about infrastructure that makes "going off the rails" structurally impossible.
I think this is exactly the crux: there are two different UX targets that get conflated.
In operator/supervisor mode (interactive CLI), you need high-signal observability while it’s running so you can abort or re-scope when it’s reading the wrong area or compounding assumptions. In batch/autonomous mode (headless / “run overnight”), you don’t need a live scrollback feed, but you still need a complete trace for audit/debug after the fact.
Collapsing file paths into counters is a batch optimization leaking into operator mode. The fix isn’t “verbose vs not” so much as separating channels: keep a small status line/spine (phase, current target, last tool call), keep an event-level trace (file paths / commands / searches) that’s persisted and greppable, and keep a truly-verbose mode for people who want every hook/subagent detail.
>The latest "meta" in AI programming appears to be agent teams (or swarms or clusters or whatever) that are designed to run for long periods of time autonomously.
more reason to catch them otherwise we have to wait a longer time. in fact hiding is more correct if the AI was less autonomous right?
>Through that lens, these changes make more sense. They're not designing UX for a human sitting there watching the agent work. They're designing for horizontally scaling agents that work in uninterrupted stretches where the only thing that matters is the final output, not the steps it took to get there.
Even in that case they should still be logging what they're doing for later investigation/auditing if something goes wrong. Regardless of whether a human or an AI ends up doing the auditing.
Totally agreed. Those assumptions often compound as well. So the AI makes one wrong decision early in the process and it affects N downstream assumptions. When they finally finish their process they've built the wrong thing. This happens with one process running. Even on latest Opus models I have to babysit and correct and redirect claude code constantly. There's zero chance that 5 claude codes running for hours without my input are going to build the thing I actually need.
And at the end of the day it's not the agents who are accountable for the code running in the production. It's the human engineers.
If they're aiming for autonomy, why have a CLI at all? Just give us a headless mode. If I'm sitting in the terminal, it means I want to control the process. Hiding logs from an operator who’s explicitly chosen to run it manually just feels weird
Agent teams working autonomously sounds cool until you actually try it. We've been running multi-agent setups and honestly the failure modes are hilarious. They don't crash, they just quietly do the wrong thing and act super confident about it.
Yes, this is why I generally still use "ask for permission" prompts.
As tedious as it is a lot of the time ( And I wish there was an in-between "allow this session" not just allow once or "allow all" ), it's invaluable to catch when the model has tried to fix the problem in entirely the wrong project.
Working on a monolithic code-base with several hundred library projects, it's essential that it doesn't start digging in the wrong place.
It's better than it used to be, but the failure mode for going wrong can be extreme, I've come back to 20+ minutes of it going around in circles frustrating itself because of a wrong meaning ascribed to an instruction.
fwiw there are more granular controls, where you can for example allow/deny specific bash commands, read or write access to specific files, using a glob syntax:
Exactly, and this is the best way to do code review while it's working so that you can steer it better. It's really weird that Anthropic doesn't get this.
You look at what Claude’s doing to make sure it doesn’t go off the rails? Personally, I either move on to another ask in parallel or just read my phone. Trying to catch things by manually looking at its output doesn’t seem like a recipe for success.
I often keep an eye on my running agents, and occasionally feel the need to correct them or to give them a bit more info because I see them sometimes diverge into areas I don't want them to go. Because they might spend much time and energy on something I already know is not gonna work.
The other side of catcing going off the rails is when it wants to make edits without it reading the context I know would’ve been neccessary for a high quality change.
It is, but Anthropic is deluded if they think they can gain a moat from that. They gladly copy/borrow ideas from other projects that's why they are paranoid that folks are doing the same.
My first thought is, for the specific problem you brought up, you find out which files were touched by your version control system, not the AI's logs. I have to do this for myself even without AI.
> it's about catching when it goes off the rails before it makes a mess
The latest "meta" in AI programming appears to be agent teams (or swarms or clusters or whatever) that are designed to run for long periods of time autonomously.
Through that lens, these changes make more sense. They're not designing UX for a human sitting there watching the agent work. They're designing for horizontally scaling agents that work in uninterrupted stretches where the only thing that matters is the final output, not the steps it took to get there.
That said, I agree with you in the sense that the "going off the rails" problem is very much not solved even on the latest models. It's not clear to me how we can trust a team of AI agents working autonomously to actually build the right thing.
None of those wild experiments are running on a "real", existing codebase that is more than 6 months old. The thing they don't talk about is that nobody outside these AI companies wants to vibe code with a 10 year old codebase with 2000 enterprise customers.
As you as you start to work with a codebase that you care about and need to seriously maintain, you'll see what a mess these agents make.
Even on codebases within the half-year age group, these LLMs often do perform nasty (read: ungodly verbose) implementations that become a maintainability nightmare. Even for the LLMs that wrote it all in the first place. I know this because we've had a steady trickle of clients and prospects expressing "challenges around maintainability and scalability" as they move toward "production readiness". Of course, asking if we can implement "better performing coding agents". As if improved harnessing or similar guardrails can solve what is in my view, a deeper problem.
The practical and opportunistic response is too tell them "Tough cookies" and watch the problems steadily compound into more lucrative revenue opportunities for us. I really have no remorse for these people. Because half of them were explicitly warned against this approach upfront but were psychologically incapable of adjusting expectations or delaying LLM deployment until the technology proved itself. If you've ever had your professional opinion dismissed by the same people regarding you as the SME, you understand my pain.
I suppose I'm just venting now. While we are now extracting money from the dumbassery, the client entitlement and management of their emotions that often comes with putting out these fires never makes for a good time.
7 replies →
I maintain serious code bases and I use LLM agents (and agent teams) plenty -- I just happen to review the code they write, I demand they write the code in a reviewable way, and use them mostly for menial tasks that are otherwise unpleasant timesinks I have to do myself. There are many people like me, that just quietly use these tools to automate the boring chores of dealing with mature production code bases. We are quiet because this is boring day-to-day work.
E.g. I use these tools to clean up or reorganize old tests (with coverage and diff viewers checking of things I might miss), update documentation with cross links (with documentation linters checking for errors I miss), convert tests into benchmarks running as part of CI, make log file visualizers, and many more.
These tools are amazing for dealing with the long tail of boring issues that you never get to, and when used in this fashion they actually abruptly increase the quality of the codebase.
10 replies →
I work at a company with approximately $1 million in revenue per engineer and multiple 10+ year old codebases.
We use agents very aggressively, combined with beads, tons of tests, etc.
You treat them like any developer, and review the code in PRs, provide feedback, have the agents act, and merge when it's good.
We have gained tremendous velocity and have been able to tackle far more out of the backlog that we'd been forced to keep in the icebox before.
This idea of setting the bar at "agents work without code reviews" is nuts.
21 replies →
That is also my experience. Doesn't even have to be a 10 year old codebase. Even a 1 year old codebase. Any one that is a serious product that is deployed in production with customers who rely on it.
Not to say that there's no value in AI written code in these codebases, because there is plenty. But this whole thing where 6 agents run overnight and "tada" in the morning with production ready code is...not real.
1 reply →
Also anything that doesn't look like a SaaS app does very badly. We had an internal trial at embedded firmware and concluded the results were unsalvageably bad. It doesn't help that the embedded environment is very unfriendly to standard testing techniques, as well.
1 reply →
I feel like you could have correctly stated this a few months ago, but the way this is "solved" is by multiple agents that babysit each other and review their output - it's unreasonably effective.
You can get extremely good results assuming your spec is actually correct (and you're willing to chew through massive quantities of tokens / wait long enough).
7 replies →
My Claude Code has been running weeks on end churning through a huge task list almost unattended on a complex 15 yr old code base, auto-committing thousands of features. It is high quality code that will go live very soon.
2 replies →
The gas town discord has two people that are doing transformation of extremely legacy in house Java frameworks. Not reporting great success yet but also probably work that just wouldn’t be done otherwise.
Oh that means you don't know how to use AI properly. Also its only 2026 imagine what AI agents can do in a few years /s
Related question: how do we resolve the problem that we sign a blank cheque for the autonomous agents to use however many tokens they deem necessary to respond to your request? The analogy from team management: you don't just ask someone in your team to look into something only to realize three weeks later (in the absence of any updates) that they got nowhere with a problem that you expected to take less than a day to solve.
EDIT: fixed typo
We'll have to solve for that sometime soon-ish I think. Claude Code has at least some sort of token estimation built-in to it now. I asked it to kick off a large agent team (~100 agents) to rewrite a bunch of SQL queries, one per agent. It did the first 10 or so, then reported back that it would cost too much to do it this way...so it "took the reins" without my permission and tried to convert each query using only the main agent and abandoned the teams. The results were bad.
But in any case, we're definitely coming up on the need for that.
> blank cheque
The Bing AI summary tells me that AI companies invested $202.3 billion in AI last year. Users are going to have to pay that back at some point. This is going to be even worse as a cost control situation than AWS.
4 replies →
An AI product manager agent trained on all the experience of product managers setting budgets for features and holding teams to it. Am I joking? I do not know.
This seems pretty in line with how you’d manage a human - you give it a time constraint. a human isn't guaranteed to fix a problem either, and humans are paid by time
We run agent teams (Navigator/Driver/Reviewer roles) on a 71K-line codebase. The trust problem is solved by not trusting the agents at all. You enforce externally. Python gates that block task completion until tests pass, acceptance criteria are verified, and architecture limits are met. The agents can't bypass enforcement mechanisms they can't touch. It's not about better prompts or more capable models. It's about infrastructure that makes "going off the rails" structurally impossible.
Which service or application?
1 reply →
I think this is exactly the crux: there are two different UX targets that get conflated.
In operator/supervisor mode (interactive CLI), you need high-signal observability while it’s running so you can abort or re-scope when it’s reading the wrong area or compounding assumptions. In batch/autonomous mode (headless / “run overnight”), you don’t need a live scrollback feed, but you still need a complete trace for audit/debug after the fact.
Collapsing file paths into counters is a batch optimization leaking into operator mode. The fix isn’t “verbose vs not” so much as separating channels: keep a small status line/spine (phase, current target, last tool call), keep an event-level trace (file paths / commands / searches) that’s persisted and greppable, and keep a truly-verbose mode for people who want every hook/subagent detail.
>The latest "meta" in AI programming appears to be agent teams (or swarms or clusters or whatever) that are designed to run for long periods of time autonomously.
more reason to catch them otherwise we have to wait a longer time. in fact hiding is more correct if the AI was less autonomous right?
>Through that lens, these changes make more sense. They're not designing UX for a human sitting there watching the agent work. They're designing for horizontally scaling agents that work in uninterrupted stretches where the only thing that matters is the final output, not the steps it took to get there.
Even in that case they should still be logging what they're doing for later investigation/auditing if something goes wrong. Regardless of whether a human or an AI ends up doing the auditing.
Looking at it from far is simply making something large from a smaller input, so its kind of like nondeterministic decompression.
What fills the holes are best practices, what can ruin the result is wrong assumptions.
I dont see how full autonomy can work either without checkpoints along the way.
Totally agreed. Those assumptions often compound as well. So the AI makes one wrong decision early in the process and it affects N downstream assumptions. When they finally finish their process they've built the wrong thing. This happens with one process running. Even on latest Opus models I have to babysit and correct and redirect claude code constantly. There's zero chance that 5 claude codes running for hours without my input are going to build the thing I actually need.
And at the end of the day it's not the agents who are accountable for the code running in the production. It's the human engineers.
17 replies →
If they're aiming for autonomy, why have a CLI at all? Just give us a headless mode. If I'm sitting in the terminal, it means I want to control the process. Hiding logs from an operator who’s explicitly chosen to run it manually just feels weird
[flagged]
Agent teams working autonomously sounds cool until you actually try it. We've been running multi-agent setups and honestly the failure modes are hilarious. They don't crash, they just quietly do the wrong thing and act super confident about it.
AI offshore teams in "yes cultures".
Yes, this is why I generally still use "ask for permission" prompts.
As tedious as it is a lot of the time ( And I wish there was an in-between "allow this session" not just allow once or "allow all" ), it's invaluable to catch when the model has tried to fix the problem in entirely the wrong project.
Working on a monolithic code-base with several hundred library projects, it's essential that it doesn't start digging in the wrong place.
It's better than it used to be, but the failure mode for going wrong can be extreme, I've come back to 20+ minutes of it going around in circles frustrating itself because of a wrong meaning ascribed to an instruction.
fwiw there are more granular controls, where you can for example allow/deny specific bash commands, read or write access to specific files, using a glob syntax:
https://code.claude.com/docs/en/settings#permission-settings
You can configure it at the project level
Yes, those permissions stick with the project in the settings file. I'd like those same permissions, but configurable per-session / context.
[flagged]
Exactly, and this is the best way to do code review while it's working so that you can steer it better. It's really weird that Anthropic doesn't get this.
[flagged]
You look at what Claude’s doing to make sure it doesn’t go off the rails? Personally, I either move on to another ask in parallel or just read my phone. Trying to catch things by manually looking at its output doesn’t seem like a recipe for success.
I often keep an eye on my running agents, and occasionally feel the need to correct them or to give them a bit more info because I see them sometimes diverge into areas I don't want them to go. Because they might spend much time and energy on something I already know is not gonna work.
It all depends on how much you're willing to spend.
If you have an unlimited budget, obviously you will tend to let it run and correct it in the next iteration.
If you often run tight up against your 5-hour window, you're going to be more likely to babysit it.
[flagged]
The other side of catcing going off the rails is when it wants to make edits without it reading the context I know would’ve been neccessary for a high quality change.
[flagged]
I assume it's to make it harder for competitors to train on Claude's Chain-of-Thought.
Not true because it is still exposed in API
And saved in files
1 reply →
It is, but Anthropic is deluded if they think they can gain a moat from that. They gladly copy/borrow ideas from other projects that's why they are paranoid that folks are doing the same.
[flagged]
My first thought is, for the specific problem you brought up, you find out which files were touched by your version control system, not the AI's logs. I have to do this for myself even without AI.
Previously you could see which files Claude was reading. If it got the totally wrong context you could interrupt and redirect it.
Since it's just reading at that stage there's no tracked changes.
[flagged]