← Back to context

Comment by almostgotcaught

2 months ago

i still don't understand what you're saying - you're just repeating that a sparse matmul is a sparse matmul ("only a small fraction of tokens are multiplied by a given expert's weight matrices"). and so i'm asking you - do you believe that a sparse matmul has low/bad arithmetic intensity?

An MoE's matmuls have the same arithmetic intensity as a dense model's matmuls, provided they're being multiplied by a batch of activation vectors of equal size.