Comment by yekanchi
21 hours ago
i mean 4bit quantized. i can roughly calculate vram for dense models by model size. but i don't know how to do it for MOE models?
21 hours ago
i mean 4bit quantized. i can roughly calculate vram for dense models by model size. but i don't know how to do it for MOE models?
Same calculation, basically. Any given ~30B model is going to use the same VRAM (assuming loading it all into VRAM, which MoEs do not need to do), is going to be the same size
MoE models need just as much VRAM as dense models because every token may use a different set of experts. They just run faster.
This isn't quite right: it'll run with the full model loaded to RAM, swapping in the experts as it needs. It has turned out in the past that experts can be stable across more than one token so you're not swapping as much as you'd think. I don't know if that's been confirmed to still be true on recent MoEs, but I wouldn't be surprised.
Also, though nobody has put the work in yet, the GH200 and GB200 (the NVIDIA "superchips" support exposing their full LPDDR5X and HBM3 as UVM (unified virtual memory) with much more memory bandwidth between LPDDR5X and HBM3 than a typical "instance" using PCIE. UVM can handle "movement" in the background and would be absolutely killer for these MoE architectures, but none of the popular inference engines actually allocate memory correctly for these architectures: cudaMallocManaged() or allow UVM (CUDA) to actually handle movement of data for them (automatic page migration and dynamic data movement) or are architected to avoid pitfalls in this environment (being aware of the implications of CUDA graphs when using UVM).
It's really not that much code, though, and all the actual capabilities are there as of about mid this year. I think someone will make this work and it will be a huge efficiency for the right model/workflow combinations (effectively, being able to run 1T parameter MoE models on GB200 NVL4 at "full speed" if your workload has the right characteristics).
What you are describing would be uselessly slow and nobody does that.
18 replies →