Comment by anon291
7 days ago
At the end of the day, the answer is... specialized hardware. No matter what you do on your local system, you don't have the interconnects necessary. Yes, they have special software, but the software would not work locally. NVIDIA sells entire solutions and specialized interconnects for this purpose. They are well out of the reach of the standard consumer.
But software wise, they shard, load balance, and batch. ChatGPT gets 1000s (or something like that) of requests every second. Those are batched and submitted to one GPU. Generating text for 1000 answers is often the same speed as generating for just 1 due to how memory works on these systems.
No comments yet
Contribute on Hacker News ↗