Comment by nightshift1

3 days ago

I think that letting an LLM run unsupervised on a task is a good way to waste time and tokens. You need to catch them before they stray too far off-path. I stopped using subagents in Claude because I wasn't able to see what they were doing and intervene. Indirectly asking an LLM to prompt another LLM to work on a long, multi-step task doesn't seem like a good idea to me. I think community efforts should go toward making LLMs more deterministic with the help of good old-fashioned software tooling instead of role-playing and writing prayers to the LLM god.

18 comments

nightshift1

danmaz74 3 days ago

When the task is bigger than I trust the agent to work on it on its own, or for me to review the results, I ask it to create a plan with steps. Then create a md file for each step. I review the steps, and ask the agent to implement the first one. Review that one, fix it, then ask it to update the next steps, and then implement the next one. And so on, until finished.

anditherobot 3 days ago
Have you tried Scoped context packages? Basically for each task, I create a .md file that includes relevant file paths, the purpose of the task, key dependencies, a clear plan of action, and a test strategy. It’s like a mini local design doc. I found that it helps ground implementation and stabilizes the output of the agents.
- genghisjahn 3 days ago
  
  I read this suggestion a lot. “Make clear steps, a clear plan of action.” Which I get. But then instead of having an LLM flail away at it could we give to an actual developer? It seems like we’ve finally realized that clear specs makes dev work much easier for LLMs. But the same is true for a human. The human will ask more clarifying questions and not hallucinate. The llm will role the dice and pick a path. Maybe we as devs would just rather talk with machines.
  
  3 replies →
thethimble 3 days ago
Separately, you have to consider that "wasting tokens spinning" might be acceptable if you're able to run hundreds of thousands of these things in parallel. If even a small subset of them translate to value, then you're far net ahead vs with a strictly manual/human process.
- pjc50 3 days ago
  
  > hundreds of thousands of these things in parallel
  At what cost,. monetary and environmental?
  
  2 replies →
sanex 3 days ago

I do the same thing with my engineers but I keep the tasks in Jira and I label them "stories".
But in all seriousness +1 can recommend this method.
meander_water 3 days ago
This is built into Cursor now with plan mode https://cursor.com/docs/agent/planning
- danmaz74 3 days ago
  
  How does Cursor plan mode differ from Claude Code plan mode? I've used the latter a lot (it's been there a long time), and the description seems very similar. The big difference with the workflow I described is that with that plan mode you don't get to review and correct what happened between steps.
  
  1 reply →
spike021 3 days ago

this plus a reset in between steps usually helps focus context in my experience

theshrike79 7 hours ago

There are two opposite ways to do this.

Codex is like an external consultant. You give it specs and it quietly putters away and only stops when the feature is done.

Claude is built more like a pair programmer, it displays changes live, "talks" about what it's doing and what's working et.

It's really, REALLY hard to abort codex mid-run to correct it. With Claude it's a lot easier when you see it doing something stupid or getting of the rails. Just hit ESC and tell it where it went wrong (like use task build, don't build it manually or use markdownlint, don't spend 5 minutes editing the markdown line by line).

hu3 3 days ago

Yeah in my experience, LLMs are great but they still need babysitting lest they add 20k lines of code that could have been 2k.

tummler 3 days ago

I also use AI to do discrete, well-defined tasks so I can keep an eye on things before they go astray.

But I thought there are lots of agentic systems that loop back and ask for approval every few steps, or after every agent does its piece. Is that not the case?