Comment by almostgotcaught
2 months ago
> High sparsity means you need a very large batch size
I don't understand what connection you're positing here? Do you think sparse matmul is actually a matmul with zeros lol
2 months ago
> High sparsity means you need a very large batch size
I don't understand what connection you're positing here? Do you think sparse matmul is actually a matmul with zeros lol
It's sparse as in only a small fraction of tokens are multiplied by a given expert's weight matrices (this is standard terminology in the MoE literature). So to properly utilize the tensor cores (hence serve DeepSeek cheaply, as the OP asks about) you need to serve enough tokens concurrently such that the per-matmul batch dimension is large.
i still don't understand what you're saying - you're just repeating that a sparse matmul is a sparse matmul ("only a small fraction of tokens are multiplied by a given expert's weight matrices"). and so i'm asking you - do you believe that a sparse matmul has low/bad arithmetic intensity?
An MoE's matmuls have the same arithmetic intensity as a dense model's matmuls, provided they're being multiplied by a batch of activation vectors of equal size.