Comment by bick_nyers

2 months ago

The general rule of thumb when assessing MoE <-> Dense model intelligence is SQRT(Total_Params*Active_Params). For Deepseek, you end up with ~158B params. The economics of batch inferencing a ~158B model at scale are different when compared to something like Deepseek (it is ~4x more FLOPS per inference after all), particularly if users care about latency.