Comment by r0b05

13 hours ago

How can a single 5090 serve 80 people? Something doesn't add up here.

8 comments

r0b05

They don't use the server all at once. In the UI, users typically ask a question, get a response, and continue with their work. In the case of autonomous agentic loops, an agent simply waits its turn until the server is ready to accept the request. Agents don't hammer the server 24/7 every second either, because they either need to be triggered or are busy doing other work, such as compiling or running tests.

r0b05 10 hours ago
It would be more interesting to know how many simultaneous users this setup can serve. Otherwise I can just say it serves 500 users but not all of them use it at the same time which doesn't communicate the right level of detail.
- p1esk 3 hours ago
  
  Depends on TTFT and tokens per second you want.

jvidalv 11 hours ago

I also call this "bollocks" there is no way this workflow is even 1/10 of what you can get with Codex/Claude Code.

A normal engineer may be running a couple of sessions with every session spawning sub agents left and right.

80 persons or even 10 having this workflow on this setup doesn't work, and this is the standard engineer workflow today.

zozbot234 10 hours ago

Subagent swarms are actually great for the local inference scenario because they can share a whole lot of KV cache. You get to raise the compute intensity of decode (i.e. the aggregate tok/s) essentially for free.

mixermachine 11 hours ago

With parallelism of 16 you can still get around 25 to 30 tokens per user when all 16 channels are running. Not everyone will use the model at the same time but it certainly will be tight, especially for agentic coding. For pure chat applications this should be quite fine.

zozbot234 10 hours ago

The problem with wide parallelism with most models is that it blows up your KV cache. There's open models with KV caches lean enough to parallelize inference or even to offload the KV cache itself to disk without immediately running into wearout concerns, but they're quite exceptional.

hacker_homie 12 hours ago

They are using it as an assistant, bot running multiple fully automated agents loops?