Comment by zozbot234

11 hours ago

It will benefit from a full amount of memory for sure, but AIUI if you use system memory and mmap for your experts you can execute the model with only enough memory for the active parameters, it's just unbearably slow since it has to swap in new experts for every token. So the more memory you have in excess to that, the more inactive but often-used experts can be kept in RAM for better performance.

The ability to stream weights from disk has nothing to do with MoE or not. You can always do this. It will be unusable either way.

  • Agreed but for a dense model you'd have to stream the whole model for every token, whereas with MoE there's at least the possibility that some experts may be "cold" for any given request and not be streamed in or cached. This will probably become more likely as models get even sparser. (The "it's unusable" judgmemt is correct if you're considering close-to-minimum reauirements, but for just getting a model to fit, caching "almost all of it" in RAM may be an excellent choice.)

    • Unlike offloading weights from VRAM to system RAM, I just can't see a situation where you would want to offload to an SSD. The difference is just too large, and any model so large you can't run it in system RAM, is going to be so large it is probably unusable except in VRAM.