Comment by DiabloD3

17 hours ago

Thats not a meaningful question. Models can be quantized to fit into much smaller memory requirements, and not all MoE layers (in MoE models) have to be offloaded to VRAM to maintain performance.

20 comments

DiabloD3

yekanchi 17 hours ago

i mean 4bit quantized. i can roughly calculate vram for dense models by model size. but i don't know how to do it for MOE models?

DiabloD3 12 hours ago

Same calculation, basically. Any given ~30B model is going to use the same VRAM (assuming loading it all into VRAM, which MoEs do not need to do), is going to be the same size
EnPissant 17 hours ago
MoE models need just as much VRAM as dense models because every token may use a different set of experts. They just run faster.
- regularfry 17 hours ago
  
  This isn't quite right: it'll run with the full model loaded to RAM, swapping in the experts as it needs. It has turned out in the past that experts can be stable across more than one token so you're not swapping as much as you'd think. I don't know if that's been confirmed to still be true on recent MoEs, but I wouldn't be surprised.
  
  16 replies →