← Back to context

Comment by DavidSJ

2 months ago

Here’s a concise explanation:

- High sparsity means you need a very large batch size (number of requests being processed concurrently) so that each matrix multiplication is of sufficient arithmetic intensity to get good utilization.

- At such a large batch size, you’ll need a decent number of GPUs — 8-16 or so depending on the type — just to fit the weights and MLA/KV cache in HBM. But with only 8-16 GPUs your aggregate throughput is going to be so low that each of the many individual user requests will be served unacceptably slowly for most applications. Thus you need more like 256 GPUs for a good user experience.

I’m serving it on 16 H100s (2 nodes). I get 50-80 tok/s per request, and in aggregate I’ve seen several thousand. TTFT is pretty stable. Is faster than any cloud service we can use.

  • H200s are pretty easy to get now. If you switched I'm guessing you'd get a nice bump because the nccl allreduce on the big mlps wouldn't have to cross infiniband.

  • You're presumably using a very small batch size compared to what I described, thus getting very low model FLOP utilization (MFU) and high dollar cost per token.

    • Yes, very tiny batch size on average. Have not optimized for MFU. This is optimized for a varying (~1-60ish) numbers of active requests while minimizing latency (time to first token and time to last token from final token) given short to medium known "prompts" and short structured responses, with very little in the way of shared prefixes in concurrent prompts.

> High sparsity means you need a very large batch size

I don't understand what connection you're positing here? Do you think sparse matmul is actually a matmul with zeros lol

  • It's sparse as in only a small fraction of tokens are multiplied by a given expert's weight matrices (this is standard terminology in the MoE literature). So to properly utilize the tensor cores (hence serve DeepSeek cheaply, as the OP asks about) you need to serve enough tokens concurrently such that the per-matmul batch dimension is large.

    • i still don't understand what you're saying - you're just repeating that a sparse matmul is a sparse matmul ("only a small fraction of tokens are multiplied by a given expert's weight matrices"). and so i'm asking you - do you believe that a sparse matmul has low/bad arithmetic intensity?

      1 reply →