Comment by woah
11 hours ago
The current fad for "agent swarms" or "model teams" seems misguided, although it definitely makes for great paper fodder (especially if you combine it with distributed systems!) and gets the VCs hot.
An LLM running one query at a time can already generate a huge amount of text in a few hours, and drain your bank account too.
A "different agent" is just different context supplied in the query to the LLM. There is nothing more than that. Maybe some of them use a different model, but again, this is just a setting in OpenRouter or whatever.
Agent parallelism just doesn't seem necessary and makes everything harder. Not an expert though, tell me where I'm wrong.
An agent is a way of performing an action that will generate context or a useful side effect without having to worry about the intermediate context.
People already do this serially by having a model write a plan, clearing the context, then having the same or a cheaper model action the plan. Doing so discards the intermediate context.
Sub-agents just let you do this in parallel. This works best when you have a task that needs to be done multiple times that cannot be done deterministically. For example, applying the same helper class usage in multiple places across a codebase, finding something out about multiple parts of the codebase, or testing a hypothesis in multiple places across a codebase.
> Agent parallelism just doesn't seem necessary and makes everything harder. Not an expert though, tell me where I'm wrong.
I use parallel agents for speed or when my single agent process loses focus due to too much context. I determine context problems by looking at the traces for complaints like "this is too complicated so I'll just do the first part" or "there are too many problems, I'll display the top 5".
If you're trying a "model swarm" to improve reliability beyond 95% or so, you need to start hoisting logic into Python scripts.
Would it be possible to give that single agent process twice as much compute, instead? Or do production systems not scale that way?
It's hard to speed up running a single prompt through a model because it's a sequential memory-bandwidth limited process (you roughly need to cycle through all of the weights in the model to get the next token, before starting again, but a GPU can do many more operations than a single weight application in between memory fetches). So it's a lot more efficient with current hardware to run multiple prompts in parallel on the same weights.
Also, the limiting factor on a single instance of an agent is generally how much of its context window gets filled up, as opposed to how much time it has to 'think'. Generally the model performance decreases at the context grows (more or less getting dumber the more it has to think about), so agent frameworks try to mitigate this by summarizing the work of one instance and passing it into another instance with a fresh context. This means if you have five tasks that are all going to fill up a model's addressable context, there's no real benefit to running them sequentially unless they naturally feed into each other.
You can increase LLM inference throughput by using smaller batch sizes but that scales non-linearly in practice. It probably isn't worth it unless your model provider makes it really easy.
Where we've had some success is with heterogeneous agents with some cheap quantised/local models performing certain tasks extremely cheaply that are then overseen or managed by a more expensive model.
I've played with this type of thing and I couldn't justify it vs just using a premium model, which seems more direct and error proof. Cheap models in my experience could really consume tokens and generate cost
Steelmanning the other side of this question:
LLMs mostly do useful work by writing stories about AI assistants who issue various commands and reply to a user's prompts. These do work, but they are fundamentally like a screenplay that the LLM is continuing.
An "agent" is a great abstraction since the LLM is used to continuing stories about characters going through narrative arcs. The type of work that would be assigned to a particular agent can also keep its context clean and distraction-free.
So parallelism could be useful even if everything is completely sequential to study how these separate characters and narrative arcs intersect in ways that are similar to real characters acting independently and simultaneously, which is what LLMs are good at writing about.
Seems like the important thing would be to avoid getting caught up on actual "wall time" parallelism
I also really appreciate the point about using LLM teams for fault tolerance protocols in the future (in addition to improving efficiency). Since agents tend to hallucinate and fail unpredictably, then coordinating multiple of them to verify and come to a consensus etc could reduce those errors
> A "different agent" is just different context supplied in the query to the LLM. There is nothing more than that.
Yup, but context includes prompt which can strongly control LLM behavior. Sometimes the harness restricts some operations to help LLM stay in its lane. And starting with fresh context and clear description of a thing it should work on is great.
People get angry when their 200k or million token context gets filled. I can't ever understand why. Keeping such amount of info in the operational memory just can't work well, for any mind. Divide and conquer, not pile up all the crap till it overfills.
you have to own the inference layer
[dead]
I tend to agree. After seeing http://chatjimmy.ai, I think multi-agent systems are mostly just solving for LLMs being slow currently.
This is like saying “multi-core cpus are just solving cpus being slow”. Which yes, exactly.