Correct. You want everything loaded, but for each forward pass just some experts get activated so the computation is less than in a dense model.
That being said, there are libraries that can load a model layer by layer (say from an ssd) and technically perform inference with ~8gb of RAM, but it'd be really really slow.
Correct. You want everything loaded, but for each forward pass just some experts get activated so the computation is less than in a dense model.
That being said, there are libraries that can load a model layer by layer (say from an ssd) and technically perform inference with ~8gb of RAM, but it'd be really really slow.
Can you give me a name please? Is that distributed llama or something else?
I have not used it but this is probably it: https://github.com/lyogavin/airllm