Comment by kgeist
12 hours ago
They don't use the server all at once. In the UI, users typically ask a question, get a response, and continue with their work. In the case of autonomous agentic loops, an agent simply waits its turn until the server is ready to accept the request. Agents don't hammer the server 24/7 every second either, because they either need to be triggered or are busy doing other work, such as compiling or running tests.
It would be more interesting to know how many simultaneous users this setup can serve. Otherwise I can just say it serves 500 users but not all of them use it at the same time which doesn't communicate the right level of detail.
Depends on TTFT and tokens per second you want.