Comment by anonym29
8 hours ago
I use llama.cpp, and there is a way to do this - some layers to (i)GPU, the rest to CPU. I was just trying this out with Kimi K2.5 (in preparation for trying it out with Kimi K2.6 the other night. Check out the --n-cpu-moe flag in llama.cpp.
That said, my Strix Halo rig only has PCIe 4.0 for my NVMe, and I'm using a 990 Evo that had poor sustained random read, being DRAM-less. My effective read speeds from disk were averaging around 1.6-2.0 GB/s, and with unsloth's K2.5, even in IQ2_XXS at "just" 326 GB, with ~64 GB worth of layers in iGPU and the rest free for KV cache + checkpoints. Even still, that was over 250 GB of weights streaming at ~2 GB/s, so I was getting 0.35 PP tok/s and 0.22 TG tok/s.
I could go a little faster with a better drive, or a little faster still if I dropping in two of em in raid0, but it would still be on the order of magnitude of sub-1 tok/s PP (compute limited) and TG (bandwidth limited).
In a computer with 2 PCIe 5.0 SSDs or one with a PCIe 5.0 SSDs and a PCIe 4.0 SSD, it should be possible to stream weights from the SSDs at 20 GB/s, or even more.
This is not a little faster, but 10 times faster than on your system. So a couple of tokens per second generation speed should be achievable.
Nowadays even many NUCs or NUC-like mini-PCs have such SSD slots.
I have actually started working at optimizing such an inference system, so your data is helpful for comparison.
Strix Halo, to my knowledge, does not support PCIe 5.0 NVMe drives, unfortunately, despite it being Zen 5, and Zen 5 supporting the PCIe 5.0 standard.
While many other NUCs may support them, what most of them lack compared to Strix Halo is a 128 GB pool of unified LPDDR5x-8000 on a 256 bit bus and the Radeon 8060S iGPU with 40 CU of RDNA 3.5, which is roughly equivalent in processing power to a laptop 4060 or desktop 3060.
The Radeon 780M and Radeon 890M integrated graphics that come on most AMD NUCs don't hold a candle to Strix Halo's 8060S, and what little you'd gain in this narrow use case with PCIe gen 5, you'd lose a lot in the more common use cases of models that can fit into a 128 GB pool of unified memory, and there are some really nice ones.
Also, the speeds you're suggesting seem rather optimistic. Gen 5 drives, as I understand, hit peak speeds of about 28-30 GB/s (with two in RAID0, at 14-15 GB/s each), but that's peak sequential reads, which is neither reflective of sustained reads, nor the random read workloads that dominate reading model weights.
Maybe there are some Intel NUCs that compete in this space that I'm less up to speed with which do support PCIe 5. I know Panther Lake costs about as much to manufacture as Strix Halo, and while it's much more power efficient and achieves a lot more compute per Xe3 graphics core than Strix Halo achieves per RDNA 3.5 CU, they Panther Lake that's actually shipping ships with so many fewer Xe3 cores that it's still a weaker system overall.
Maybe DGX Spark supports PCIe 5.0, I don't own one and am admittedly not as familiar with that platform either, though it's worth mentioning that the price gap between Strix Halo and DGX Spark at launch ($2000 vs $4000) has closed a bit (many Strix Halo run $3000 now, vs $4700 for DGX Spark, and I think some non-Nvidia GB10 systems are a bit cheaper still)
While you are right about the advantages of Strix Halo, those advantages matter only as long as you can fit the entire model inside the 128 GB DRAM.
If you use a bigger model and your performance becomes limited by the SSD throughput, than a slower CPU and GPU will not affect the performance in an optimized implementation, where weights are streamed continuously from the SSDs and all computations are overlapped over that.
I have an ASUS NUC with Arrow Lake H and 2 SSDs, one PCIe 5.0 and one PCIe 4.0. I also have a Zen 5 desktop, which like most such desktops also has 2 SSDs, one PCIe 5.0 and one PCIe 4.0. Many Ryzen motherboards, including mine, allow multiple PCIe 4.0 SSDs, but those do not increase the throughput, because they share the same link between the I/O bridge and the CPU.
So with most cheap computers you can have 1 PCIe 5.0 SSD + 1 PCIe 4.0 SSD. With PCIe 4.0, it is easy to find SSDs that reach the maximum throughput of the interface, i.e. between 7 and 7.5 GB/s. For PCIe 5.0, the throughput depends on how expensive the SSD is and on how much power it consumes, from only around 10 GB/s up to the interface limit, i.e. around 15 GB/s.
With SSDs having different speeds, RAID0 is not appropriate, but the interleaving between weights stored on one SSD and on the other must be done in software, i.e. one third must be stored on the slower SSD and two thirds on the faster SSD.
A Zen 5 desktop with a discrete GPU is faster than Strix Halo when not limited by the main memory interface, but in the case when the performance is limited by the SSDs throughput I bet that even the Intel NUC can reach that limit and a faster GPU/CPU combo would not make a difference.
2 replies →
Now I want to put two p5800x's to use. I wonder how much tinkering would be necessary to mmap a raid setup with them directly to the gpu. Im not fully busy with LLM's and more with graphics and systems, but this seems like a fun project to try out.