Comment by rythie
7 days ago
First off I’d say you can run models locally at good speed, llama3.1:8b runs fine a MacBook Air M2 with 16GB RAM and much better on a Nvidia RTX3050 which are fairly affordable.
For OpenAI, I’d assume that a GPU is dedicated to your task from the point you press enter to the point it finishes writing. I would think most of the 700 million barely use ChatGPT and a small proportion use it a lot and likely would need to pay due to the limits. Most of the time you have the website/app open I’d think you are either reading what it has written, writing something or it’s just open in the background, so ChatGPT isn’t doing anything in that time. If we assume 20 queries a week taking 25 seconds each. That’s 8.33 minutes a week. That would mean a single GPU could serve up to 1209 users, meaning for 700 million users you’d need at least 578,703 GPUs. Sam Altman has said OpenAI is due to have over a million GPUs by the end of year.
I’ve found that the inference speed on newer GPUs is barely faster than older ones (perhaps it’s memory speed limited?). They could be using older clusters of V100, A100 or even H100 GPUs for inference if they can get the model to fit or multiple GPUs if it doesn’t fit. A100s were available in 40GB and 80GB versions.
I would think they use a queuing system to allocate your message to a GPU. Slurm is widely used in HPC compute clusters, so might use that, though likely they have rolled their own system for inference.
The idea that a GPU is dedicated to a single inference task is just generally incorrect. Inputs are batched, and it’s not a single GPU handling a single request, it’s a handful of GPUs in various parallelism schemes processing a batch of requests at once. There’s a latency vs throughput trade off that operators make. The larger that batch size the greater the latency, but it improves overall cluster throughput.