Comment by DavidSJ

2 months ago

Here’s a concise explanation:

- High sparsity means you need a very large batch size (number of requests being processed concurrently) so that each matrix multiplication is of sufficient arithmetic intensity to get good utilization.

- At such a large batch size, you’ll need a decent number of GPUs — 8-16 or so depending on the type — just to fit the weights and MLA/KV cache in HBM. But with only 8-16 GPUs your aggregate throughput is going to be so low that each of the many individual user requests will be served unacceptably slowly for most applications. Thus you need more like 256 GPUs for a good user experience.

11 comments

DavidSJ

zxexz 2 months ago

I’m serving it on 16 H100s (2 nodes). I get 50-80 tok/s per request, and in aggregate I’ve seen several thousand. TTFT is pretty stable. Is faster than any cloud service we can use.

zackangelo 2 months ago

H200s are pretty easy to get now. If you switched I'm guessing you'd get a nice bump because the nccl allreduce on the big mlps wouldn't have to cross infiniband.
DavidSJ 2 months ago
You're presumably using a very small batch size compared to what I described, thus getting very low model FLOP utilization (MFU) and high dollar cost per token.
- zxexz 2 months ago
  
  Yes, very tiny batch size on average. Have not optimized for MFU. This is optimized for a varying (~1-60ish) numbers of active requests while minimizing latency (time to first token and time to last token from final token) given short to medium known "prompts" and short structured responses, with very little in the way of shared prefixes in concurrent prompts.
latchkey 2 months ago

You could do it on one node of 8xMI300x and cut your costs down.
majke 2 months ago
Using vllm?
- zxexz 2 months ago
  
  Oh, SGLang. Had to make a couple modifications, I forget what they were, nothing crazy. Lots of extra firmware, driver and system config too.

almostgotcaught 2 months ago

> High sparsity means you need a very large batch size

I don't understand what connection you're positing here? Do you think sparse matmul is actually a matmul with zeros lol

DavidSJ 2 months ago
It's sparse as in only a small fraction of tokens are multiplied by a given expert's weight matrices (this is standard terminology in the MoE literature). So to properly utilize the tensor cores (hence serve DeepSeek cheaply, as the OP asks about) you need to serve enough tokens concurrently such that the per-matmul batch dimension is large.
- almostgotcaught 2 months ago
  
  i still don't understand what you're saying - you're just repeating that a sparse matmul is a sparse matmul ("only a small fraction of tokens are multiplied by a given expert's weight matrices"). and so i'm asking you - do you believe that a sparse matmul has low/bad arithmetic intensity?
  
  1 reply →