← Back to context

Comment by zozbot234

8 days ago

You don't even need system RAM for the inactive experts, they can simply reside on disk and be accessed via mmap. The main remaining constraints these days will be any dense layers, plus the context size due to KV cache. The KV cache has very sparse writes so it can be offloaded to swap.

Are there any benchmarks (or even vibes!) about the token/second one can expect with this strategy?

  • No real fixed benchmarks AIUI since performance will then depend on how much extra RAM you have (which in turn depends on what queries you're making, how much context you're using etc.) and how high-performance your storage is. Given enough RAM, you aren't really losing any performance because the OS is caching everything for you.

    (But then even placing inactive experts in system RAM is controversial: you're leaving perf on the table compared to having them all in VRAM!)

  • In my short testing on a different MoE model, it does not perform well. I tried running Kimi-K2-Thinking-GGUF with the smallest unsloth quantization (UD-TQ1_0, 247 GB), and it ran at 0.1 tps. According to its guide, you should expect 5 tps if the whole model can fit into RAM+VRAM, but if mmap has to be used, then expect less than 1 tps which matches my test. This was on a Ryzen AI Max+ 395 using ~100 GB VRAM.

    • Running a 247GB model reliably on 100GB VRAM total is a very impressive outcome no matter what the performance. That size of model is one where sensible people will recommend at least 4x the VRAM amount compared to what you were testing with - at that point, the total bandwidth to your storage becomes the bottleneck. Try running models that are just slightly bigger than the amount of VRAM you're using and these tricks become quite essential, for a significantly more manageable hit on performance.