Agent orchestration for the timid

16 days ago (substack.com)

33 comments

markferree

px1999 16 days ago

Imo there's a huge blind spot forming between 6 and 8 when talking to people and in reading posts by various agent evangelists - few people seem to be focussing on building "high quality" changes vs maximising throughput of low quality work items.

My (boring b2b/b2e) org has scripts that wrap a small handful of agent calls to handle/automate our workflow. These have been incredibly valuable.

We still 'yolo' into PRs, use agents to improve code quality, do initial checks via gating. We're trying to get docs working through the same approach. We see huge value in automating and lightweight orchestration of agents, but other parts of the whole system are the bottleneck, so theres no real point in running more than a couple of agents concurrently - claude could already build a low quality version our entire backlog in a week.

Is anyone exploring the (imo more practically useful today) space of using agents to put together better changes vs "more commits"?

CuriouslyC 16 days ago

I have a code quality analysis tool that I use to "un-slopify" AI code. It doesn't handle algorithms and code semantics, which are still the programmer's domain, but it does a pretty good job of forcing agents to dry out code, separate concerns, group code more intelligently and generally write decoupled quasi-functional code. It works quite well with the raph loop to deeply restructure codebases.
https://github.com/sibyllinesoft/valknut
lemming 16 days ago

Is anyone exploring the (imo more practically useful today) space of using agents to put together better changes vs "more commits"?
Yes, I am, although not really in public yet. I use the pi harness, which is really easy to extend. I’m basically driving a deterministic state machine for each code ticket, which starts with refining a short ticket into a full problem description by interviewing me one question at a time, then converts that into a detailed plan with individual steps. Then it implements each step one by one using TDD, and each bit gets reviewed by an agent in a fresh context. So first tests are written, and they’re reviewed to ensure they completely cover the initial problem, and any problems are addressed. That goes round a loop till the review agent is happy, then it moves to implementation. Same thing, implementation is written, loop until the tests pass, then review and fix until the reviewer is happy. Each sub task gets its own commit. Then when all the tasks are done, there’s an overall review that I look at. Then if everyone is happy the commits get squashed and we move to manual testing. The agent comes up with a full list of manual tests to cover the change, sets up the test scenarios and tells me where to debug in the code while working through each test case so I understand what’s been implemented. So this is semi automated - I’m heavily involved at the initial refine stage, then I check the plan. The various implementation and review loops are mostly hands off, then I check the final review and do the manual testing obviously.
This is definitely much slower than something like Gas Town, but all the components are individually simple, the driver is a deterministic program, not an agent, and I end up carefully reviewing everything. The final code quality is very good. I generally have 2-4 changes like this ongoing at any one time in tmux sessions, and I just switch between them. At some point I might make a single dashboard with summaries of where the process is up to on each, and whether it needs my input, but right now I like the semi manual process.
throwup238 16 days ago

> Is anyone exploring the (imo more practically useful today) space of using agents to put together better changes vs "more commits"?
That’s what I’ve been focused on the last few weeks with my own agent orchestrator. The actual orchestration bit was the easy part but the key is to make it self improving via “workflow reviewer” agents that can create new reviewers specializing in catching a specific set of antipatterns, like swallowing errors. Unfortunately I've found that what decides acceptable code quality is very dependent on project, organization, and even module (tests vs internal utilities vs production services) so prompt instructions like "don't swallow errors or use unwrap" make one part of the code better while another gets worse, creating a conflict for the LLM.
The problem is that model eval was already the hardest part of using LLMs and evaluating agents is even harder if not practically impossible. The toy benchmarks the AI companies have been using are laughably inadequate.
So far the best I’ve got is “reimplement MINPACK from scratch using their test suite” which can take days and has to be manually evaluated.
pnocera 14 days ago

I've been playing with Brad Ross's AISP [1] to get a better quality of llm outputs at strategic stages of our basic design / plan / implementation workflows.
A concrete example of this is our Adviser Skill experiment [2]. In most AI workflows, a "reviewer" agent just dumps markdown feedback. Our Adviser doesn't just "talk"; it outputs an AISP 5.1 document ( a kind of "Assembly Language for AI Cognition" )
This document forces the agent to define:
- Strict Type Definitions for the issues identified (e.g., distinguishing between a gap, an edge case, or a missing requirement).
- EARS Rules (Easy Approach to Requirements Syntax) that determine the verdict. For example, a rule might state: "If any issue has a severity of ⊘ (critical), then the workflow MUST halt."
- Formal Evidence: Every "approve" or "reject" verdict must include a confidence score (δ) and a grounding proof (π) that explains why the change matches the original specification.
By treating the agent's output as a proof-carrying protocol rather than just text, we can chain multiple specialized agents (Architect, Strategist, Auditor) who "triangulate" on the codebase. They must reach a formal consensus where the variance between their scores is low.
This shifts the agent's goal from "Finish the task at all costs" to "Prove that this change is safe and correct." It turns out that iterating on the verification logic is much more effective for building reliable systems than just increasing the number of agents running concurrently.
[1] Brad Ross AISP : https://github.com/bar181/aisp-open-core
[2] Adviser skill : https://github.com/pnocera/skilld

xyzsparetimexyz 16 days ago

What kind of basic ass CRUD apps are people even working on that they're on stage 5 and up? Certainly not anything with performance, visual, embedded or GPU requirements.

IanCal 16 days ago
I think you massively underestimate the number of useful apps that are crud and a bit of business logic and styling. They’re useful, can genuinely take time to build, can be unique every time, and yet not brand new research projects.
- krackers 16 days ago
  
  A lot of stuff is simultaneously useful but not mission critical, which is where I think the sweet spot of LLMs currently lies.
  In terms of the state of software quality, the bar has actually been _lowered_, in that even major user-facing bugs in operating systems are no longer a showstopper. So it's no surprise to me that people are vibe-coding things "in prod" that they actually sell to other people (some even theorize claude code itself is vibe-coded, hence its bugs. And yet that hasn't slowed down adoption because of the claude max lock in).
  So maybe one alternate way to see the "productivity gains" from vibe-coding in deployed software is that it's actually a realization that quality doesn't matter. The seeds for this were already laid years back when QA vanished as a field.
  LLMs occupy a new realm in the pareto frontier, the "slipshod expert". Usually humans grow from "sloppy incompetent newb" to the "prudent experienced dev". But now we have a strange situation where LLMs can write code (e.g. vectorized loops, cuda kernels) that could normally only be done by those with sufficient domain knowledge, and yet (ironically) it's not done with the attention and fastidiousness you'd expect from such an experienced dev.
- xyzsparetimexyz 16 days ago
  
  No totally, I agree. But I don't think that anyone will be YOLO vibe coding massive changes into Blender or ffmpeg any time soon.
  
  1 reply →
NitpickLawyer 15 days ago
I find it funny that as these systems become better at something (i.e. "basic ass CRUD"), people still maintain that they're only good at those and nothing else.
Case in point - https://github.com/NVlabs/vibetensor/blob/main/docs/vibetens...
> VIBETENSOR is an open-source research system software stack for deep learning, generated by LLM-powered coding agents under high-level human guidance. In this paper, “fully generated” refers to code provenance: implementation changes were produced and applied as agent-proposed diffs; validation relied on builds, tests, and differential checks executed by the agent workflow, without per-change manual diff review.
- xyzsparetimexyz 15 days ago
  
  So they wrote a codebase that will neither be read nor used. So what?
  
  1 reply →
tjr 16 days ago
What would be an example of something you think wouldn’t work with 5 or higher? Is there something about GPU programming that LLMs can’t handle?
- xyzsparetimexyz 16 days ago
  
  I doubt they'd do a very good job of debugging a gpu crash, or visual noise caused by forgotten synchronization, or odd looking shadows.
  Mayybe for some things you could set it up so that the screen output is livestreamed back into the agent, but I highly doubt that anyone is doing that for agents like this yet
  
  5 replies →

LordHumungous 15 days ago

Do people actually have success with agent orchestrators? I find that it quickly overwhelms my ability to keep track of what its doing.

SCdF 15 days ago
This is the fracture in the industry I don't think we are talking about enough.
It overwhelms everyone's ability to keep track of what it's doing. Some people are just no longer keeping track.
I have no idea if people are just doing this to toy projects, or real actual production things. I am getting the sneaking suspicion it's both at this point.
- nulone 15 days ago
  
  Orchestration buys parallelism, not coherence. More agents means more drift between assumptions. Past a point you're just generating merge conflicts with extra steps.

myleshenderson 15 days ago

Given the infancy of all of these tools, it makes sense to experiment. So trying out everything that is not gas town is reasonable.

I haven't yet tried gas town (or any of the mentioned tools) as I don't need so many agents that I need something like that plus the cost concerns. I've been rolling my own very light orchestrator (mostly just worktrees/branches/instructions) and relying on claude itself to manage the sub agents as necessary.

I was a bit surprised by the "ripping out beads" sentence from all of the article, as beads does seem to serve a purpose independent of the orchestration tools. Giving agents a ticketing system independent of what us humans use makes a lot of sense to me.

I've experimented with using Jira/Linear to handle the "current work todos" and using beads just seems so much better. No mcps and remote api calls is pretty great.

I'll be curious to see how the other orchestration tools are handling this, because it seems like they will have to handle it.

markferree 15 days ago
Replaced beads with a skill reminding Claude it could use gh cli to manage GitHub issues and never really looked back. I had already noticed on smaller projects that a markdown punch list plus the built in todo tool was usually more than enough and between those two didn’t feel the need for beads anymore.
- myleshenderson 12 days ago
  
  I appreciate the follow up. I'm using Jira as the primary work tracker for my day job, so I'm hesitant to interact with that any more than is necessary. Though we do have a skill for Jira primarily for humans to tell the LLM "hey create a ticket...".
  I'll play with that a bit to see what happens.

sathish316 15 days ago

Pied-Piper is another Subagents orchestration system that works from a single Claude code session with an orchestrator and multiple agents that handoff tasks to each other to follow a workflow - https://github.com/sathish316/pied-piper

It has Playbooks for repeatable workflows using which you can model both generic SDLC workflows (Plan-Code-Review-Security review-Merge) or complex workflows like Language migration, Tech stack migration (Problem breakdown-Plan-Migrate-IntegrationTest-TechStackExpertReview-CodeReview-Merge)

Hopefully, it will have the least amount of changes once Claude Swarm and Claude Tasks becomes mainstream

magicmicah85 16 days ago

When I want to learn code or understand a new architecture, I stick at stage 1. When I want to validate an idea, stage 5 and beyond makes perfect sense to go YOLO. I might have to try one of these orchestrators one day, but only when I'm regularly getting stopped cause I've hit my credit limit. For my current usage, I'm pretty happy where I'm at.

MarcelOlsz 16 days ago

Worth looking into Conductor.build and Sculptor as well, though I believe both are electron and run like sh*t but Conductor is quite good. Going to give this Vibe Kanban a go, thanks.

Orchestration is cool but a sane orchestration setup with VM's is where it's at.

I've been working on orchestration for the past little while and I've got a very tight loop going where everything is in worktrees and containerized, all services are isolated, so I can easily work on db schema/migration stuff while a few other agents do frontend work etc. Getting Conductor to play nice with vm's was very tricky as their docs say they have no intention of implementing vm's and wrote a "trust me bro, it won't erase your system" blurb about it in their docs [0]

[0] https://docs.conductor.build/faq#what-permissions-do-agents-...

Charlieholtz 15 days ago
co-creator of Conductor here! Conductor is actually a Tauri 2.0 app so it uses the native safari renderer. working on getting remote workspaces working as we speak!
- MarcelOlsz 13 days ago
  
  >so it uses the native safari renderer.
  Brutal. I had to renice it and do some shenanigans to make it work crispy but now it's decent. Has a lot of hang ups like if you leave a text input focused for too long the fans on your laptop will start.
xyzsparetimexyz 16 days ago

Could you perhaps replace the VMs with bubblewrap instead?