Comment by majormajor
18 days ago
> When you commit to a single agent, you're predicting it will be best for whatever task you throw at it.
A quibbe with this: you're not predicting it will be the best for whatever task you're throwing at it, you're predicting it will be sufficient.
For well-understood problems you can get adequate results out of a lot of models these days. Having to review n different outputs sounds like a step backwards for most tasks.
I do this sort of thing at the planning stage, though. Especially because there's not necessarily an obvious single "right" answer for a lot of questions, like how to break down a domain, or approaches to coordinating multiple processes. So if three different models suggest three different approaches, it helps me refine what I'm actually looking for in the solution. And that increases my hit rate for my "most models will do something sufficient" above claim.
This is a good point!
We still code via interactive sessions with single agents when the stakes are lower (simple things, one off scripts, etc). But for more important stuff, we generally want the highest quality solution possible.
We also use this framework for brainstorming and planning. E.g. sometimes we ask them to write design docs, then compare and contrast. Or intentionally under-specify a task, see what the agents do, and use that to refine the spec before launching the real run.