← Back to context

Comment by theanonymousone

16 hours ago

But the RAM+VRAM can never be less than the size of the total (not active) model, right?

Correct. You want everything loaded, but for each forward pass just some experts get activated so the computation is less than in a dense model.

That being said, there are libraries that can load a model layer by layer (say from an ssd) and technically perform inference with ~8gb of RAM, but it'd be really really slow.