← Back to context

Comment by Szpadel

7 days ago

AFAIK main trick is batching, GPU can do same work on batch of data, you can work on many requests at the same time more efficiently.

batching requests increase latency to first token, so it's tradeoff and MoE makes it more tricky because they are not equally used.

there was somewhere great article explaining deepseek efficiency that explained it in great detail (basically latency - throughput tradeoff)

Your model keeps the weights on slow memory and needs to touch all of them to make 1 token for you. By batching you make 64 tokens for 64 users in one go. And they use dozens of GPUs in parallel to make 1024 tokens in the time your system makes 1 token. So even though the big system costs more, it is much more efficient when being used by many users in parallel. Also, by using many fast GPUs in series to process parts of the neural net, it produces output much faster for each user compared to your local system. You can't beat that.