They don't use the server all at once. In the UI, users typically ask a question, get a response, and continue with their work. In the case of autonomous agentic loops, an agent simply waits its turn until the server is ready to accept the request. Agents don't hammer the server 24/7 every second either, because they either need to be triggered or are busy doing other work, such as compiling or running tests.
It would be more interesting to know how many simultaneous users this setup can serve. Otherwise I can just say it serves 500 users but not all of them use it at the same time which doesn't communicate the right level of detail.
Subagent swarms are actually great for the local inference scenario because they can share a whole lot of KV cache. You get to raise the compute intensity of decode (i.e. the aggregate tok/s) essentially for free.
With parallelism of 16 you can still get around 25 to 30 tokens per user when all 16 channels are running.
Not everyone will use the model at the same time but it certainly will be tight, especially for agentic coding.
For pure chat applications this should be quite fine.
The problem with wide parallelism with most models is that it blows up your KV cache. There's open models with KV caches lean enough to parallelize inference or even to offload the KV cache itself to disk without immediately running into wearout concerns, but they're quite exceptional.
They don't use the server all at once. In the UI, users typically ask a question, get a response, and continue with their work. In the case of autonomous agentic loops, an agent simply waits its turn until the server is ready to accept the request. Agents don't hammer the server 24/7 every second either, because they either need to be triggered or are busy doing other work, such as compiling or running tests.
It would be more interesting to know how many simultaneous users this setup can serve. Otherwise I can just say it serves 500 users but not all of them use it at the same time which doesn't communicate the right level of detail.
Depends on TTFT and tokens per second you want.
I also call this "bollocks" there is no way this workflow is even 1/10 of what you can get with Codex/Claude Code.
A normal engineer may be running a couple of sessions with every session spawning sub agents left and right.
80 persons or even 10 having this workflow on this setup doesn't work, and this is the standard engineer workflow today.
Subagent swarms are actually great for the local inference scenario because they can share a whole lot of KV cache. You get to raise the compute intensity of decode (i.e. the aggregate tok/s) essentially for free.
With parallelism of 16 you can still get around 25 to 30 tokens per user when all 16 channels are running. Not everyone will use the model at the same time but it certainly will be tight, especially for agentic coding. For pure chat applications this should be quite fine.
The problem with wide parallelism with most models is that it blows up your KV cache. There's open models with KV caches lean enough to parallelize inference or even to offload the KV cache itself to disk without immediately running into wearout concerns, but they're quite exceptional.
They are using it as an assistant, bot running multiple fully automated agents loops?