Comment by jp57
7 days ago
700M weekly users doesn't say much about how much load they have.
I think the thing to remember is that the majority of chatGPT users, even those who use it every day, are idle 99.9% of the time. Even someone who has it actively processing for an hour a day, seven days a week, is idle 96% of the time. On top of that, many are using less-intensive models. The fact that they chose to mention weekly users implies that there is a significant tail of their user distribution who don't even use it once a day.
So your question factors into a few of easier-but-still-not-trivial problems:
- Making individual hosts that can fit their models in memory and run them at acceptable toks/sec.
- Making enough of them to handle the combined demand, as measured in peak aggregate toks/sec.
- Multiplexing all the requests onto the hosts efficiently.
Of course there are nuances, but honestly, from a high level last problem does not seem so different from running a search engine. All the state is in the chat transcript, so I don't think there any particular reason reason that successive interactions on the same chat need be handled by the same server. They could just be load-balanced to whatever server is free.
We don't know, for example, when the chat says "Thinking..." whether the model is running or if it's just queued waiting for a free server.
No comments yet
Contribute on Hacker News ↗