← Back to context

Comment by ritz_labringue

6 days ago

You’re right, I conflated two things. MoE improves compute efficiency per token (only a few experts run), but it doesn’t meaningfully reduce memory footprint.

For fast inference you typically keep all experts in memory (or shard them), so VRAM still scales with the total number of experts.

Practically, that’s why home setups are wasteful: you buy a GPU for its VRAM capacity, but MoE only activates a fraction of the compute each token, and some experts/devices sit idle (because you are the only one using the model).

MoE does not make batching more efficient, but it demands larger batches to maximize compute utilization and to amortize routing. Dense models batch trivially (same weights every token). MoE batches well once the batch is large enough so each expert has work. So the point isn’t that MoE makes batching better, but that MoE needs bigger batches to reach its best utilization.