Comment by gok
2 months ago
MoE is in general kind of a stupid optimization. It seems to require around 5x more total parameters for the same modeling power as a dense model in exchange for around 2x less memory bandwidth needs.
The primary win of MoE models seems to be that you can list an enormous parameter count in your marketing materials.
Stupid? By paying 5x (normally 2-4x, but whatever) of a thing you don't care about at inference you can gain 2x in the primary thing you care about at inference. It's like handing out 4 extra bricks and getting back an extra lump of gold.
The general rule of thumb when assessing MoE <-> Dense model intelligence is SQRT(Total_Params*Active_Params). For Deepseek, you end up with ~158B params. The economics of batch inferencing a ~158B model at scale are different when compared to something like Deepseek (it is ~4x more FLOPS per inference after all), particularly if users care about latency.