Comment by almostgotcaught

2 months ago

> High sparsity means you need a very large batch size

I don't understand what connection you're positing here? Do you think sparse matmul is actually a matmul with zeros lol

3 comments

almostgotcaught

It's sparse as in only a small fraction of tokens are multiplied by a given expert's weight matrices (this is standard terminology in the MoE literature). So to properly utilize the tensor cores (hence serve DeepSeek cheaply, as the OP asks about) you need to serve enough tokens concurrently such that the per-matmul batch dimension is large.

almostgotcaught 2 months ago
i still don't understand what you're saying - you're just repeating that a sparse matmul is a sparse matmul ("only a small fraction of tokens are multiplied by a given expert's weight matrices"). and so i'm asking you - do you believe that a sparse matmul has low/bad arithmetic intensity?
- DavidSJ 2 months ago
  
  An MoE's matmuls have the same arithmetic intensity as a dense model's matmuls, provided they're being multiplied by a batch of activation vectors of equal size.