← Back to context

Comment by zozbot234

6 hours ago

I'm all for running large MoE models on unified memory systems, but developers of inference engines should do a better job of figuring out how to run larger-than-total-RAM models on such systems, streaming in sparse weights from SSD but leveraging the large unified memory as cache. This is easily supported with pure-CPU inference via mmap, but there is no obvious equivalent when using the GPU for inference.

At least for the CPU/GPU split, llama.cpp recently added a `--fit` parameter (might default to on now?) that pairs with a `--fitc CONTEXTSIZE` parameter. That new feature will automatically look at your available VRAM and try to figure out a good CPU/GPU split for large models that leaves enough room for the context size that you request.

I use llama.cpp, and there is a way to do this - some layers to (i)GPU, the rest to CPU. I was just trying this out with Kimi K2.5 (in preparation for trying it out with Kimi K2.6 the other night. Check out the --n-cpu-moe flag in llama.cpp.

That said, my Strix Halo rig only has PCIe 4.0 for my NVMe, and I'm using a 990 Evo that had poor sustained random read, being DRAM-less. My effective read speeds from disk were averaging around 1.6-2.0 GB/s, and with unsloth's K2.5, even in IQ2_XXS at "just" 326 GB, with ~64 GB worth of layers in iGPU and the rest free for KV cache + checkpoints. Even still, that was over 250 GB of weights streaming at ~2 GB/s, so I was getting 0.35 PP tok/s and 0.22 TG tok/s.

I could go a little faster with a better drive, or a little faster still if I dropping in two of em in raid0, but it would still be on the order of magnitude of sub-1 tok/s PP (compute limited) and TG (bandwidth limited).

  • In a computer with 2 PCIe 5.0 SSDs or one with a PCIe 5.0 SSDs and a PCIe 4.0 SSD, it should be possible to stream weights from the SSDs at 20 GB/s, or even more.

    This is not a little faster, but 10 times faster than on your system. So a couple of tokens per second generation speed should be achievable.

    Nowadays even many NUCs or NUC-like mini-PCs have such SSD slots.

    I have actually started working at optimizing such an inference system, so your data is helpful for comparison.

    • Strix Halo, to my knowledge, does not support PCIe 5.0 NVMe drives, unfortunately, despite it being Zen 5, and Zen 5 supporting the PCIe 5.0 standard.

      While many other NUCs may support them, what most of them lack compared to Strix Halo is a 128 GB pool of unified LPDDR5x-8000 on a 256 bit bus and the Radeon 8060S iGPU with 40 CU of RDNA 3.5, which is roughly equivalent in processing power to a laptop 4060 or desktop 3060.

      The Radeon 780M and Radeon 890M integrated graphics that come on most AMD NUCs don't hold a candle to Strix Halo's 8060S, and what little you'd gain in this narrow use case with PCIe gen 5, you'd lose a lot in the more common use cases of models that can fit into a 128 GB pool of unified memory, and there are some really nice ones.

      Also, the speeds you're suggesting seem rather optimistic. Gen 5 drives, as I understand, hit peak speeds of about 28-30 GB/s (with two in RAID0, at 14-15 GB/s each), but that's peak sequential reads, which is neither reflective of sustained reads, nor the random read workloads that dominate reading model weights.

      Maybe there are some Intel NUCs that compete in this space that I'm less up to speed with which do support PCIe 5. I know Panther Lake costs about as much to manufacture as Strix Halo, and while it's much more power efficient and achieves a lot more compute per Xe3 graphics core than Strix Halo achieves per RDNA 3.5 CU, they Panther Lake that's actually shipping ships with so many fewer Xe3 cores that it's still a weaker system overall.

      Maybe DGX Spark supports PCIe 5.0, I don't own one and am admittedly not as familiar with that platform either, though it's worth mentioning that the price gap between Strix Halo and DGX Spark at launch ($2000 vs $4000) has closed a bit (many Strix Halo run $3000 now, vs $4700 for DGX Spark, and I think some non-Nvidia GB10 systems are a bit cheaper still)

      2 replies →

  • Now I want to put two p5800x's to use. I wonder how much tinkering would be necessary to mmap a raid setup with them directly to the gpu. Im not fully busy with LLM's and more with graphics and systems, but this seems like a fun project to try out.