Comment by anonym29

8 hours ago

Strix Halo, to my knowledge, does not support PCIe 5.0 NVMe drives, unfortunately, despite it being Zen 5, and Zen 5 supporting the PCIe 5.0 standard.

While many other NUCs may support them, what most of them lack compared to Strix Halo is a 128 GB pool of unified LPDDR5x-8000 on a 256 bit bus and the Radeon 8060S iGPU with 40 CU of RDNA 3.5, which is roughly equivalent in processing power to a laptop 4060 or desktop 3060.

The Radeon 780M and Radeon 890M integrated graphics that come on most AMD NUCs don't hold a candle to Strix Halo's 8060S, and what little you'd gain in this narrow use case with PCIe gen 5, you'd lose a lot in the more common use cases of models that can fit into a 128 GB pool of unified memory, and there are some really nice ones.

Also, the speeds you're suggesting seem rather optimistic. Gen 5 drives, as I understand, hit peak speeds of about 28-30 GB/s (with two in RAID0, at 14-15 GB/s each), but that's peak sequential reads, which is neither reflective of sustained reads, nor the random read workloads that dominate reading model weights.

Maybe there are some Intel NUCs that compete in this space that I'm less up to speed with which do support PCIe 5. I know Panther Lake costs about as much to manufacture as Strix Halo, and while it's much more power efficient and achieves a lot more compute per Xe3 graphics core than Strix Halo achieves per RDNA 3.5 CU, they Panther Lake that's actually shipping ships with so many fewer Xe3 cores that it's still a weaker system overall.

Maybe DGX Spark supports PCIe 5.0, I don't own one and am admittedly not as familiar with that platform either, though it's worth mentioning that the price gap between Strix Halo and DGX Spark at launch ($2000 vs $4000) has closed a bit (many Strix Halo run $3000 now, vs $4700 for DGX Spark, and I think some non-Nvidia GB10 systems are a bit cheaper still)

While you are right about the advantages of Strix Halo, those advantages matter only as long as you can fit the entire model inside the 128 GB DRAM.

If you use a bigger model and your performance becomes limited by the SSD throughput, than a slower CPU and GPU will not affect the performance in an optimized implementation, where weights are streamed continuously from the SSDs and all computations are overlapped over that.

I have an ASUS NUC with Arrow Lake H and 2 SSDs, one PCIe 5.0 and one PCIe 4.0. I also have a Zen 5 desktop, which like most such desktops also has 2 SSDs, one PCIe 5.0 and one PCIe 4.0. Many Ryzen motherboards, including mine, allow multiple PCIe 4.0 SSDs, but those do not increase the throughput, because they share the same link between the I/O bridge and the CPU.

So with most cheap computers you can have 1 PCIe 5.0 SSD + 1 PCIe 4.0 SSD. With PCIe 4.0, it is easy to find SSDs that reach the maximum throughput of the interface, i.e. between 7 and 7.5 GB/s. For PCIe 5.0, the throughput depends on how expensive the SSD is and on how much power it consumes, from only around 10 GB/s up to the interface limit, i.e. around 15 GB/s.

With SSDs having different speeds, RAID0 is not appropriate, but the interleaving between weights stored on one SSD and on the other must be done in software, i.e. one third must be stored on the slower SSD and two thirds on the faster SSD.

A Zen 5 desktop with a discrete GPU is faster than Strix Halo when not limited by the main memory interface, but in the case when the performance is limited by the SSDs throughput I bet that even the Intel NUC can reach that limit and a faster GPU/CPU combo would not make a difference.

  • That sounds like a huge hassle for what I imagine must be peak speeds of low double digit tok/s PP and TG, even with effective prompt caching and self-ngram and all the other tricks, no?

    If I really feel like I needed larger models locally (I don't, the 120/122B A10/12B models are awesome on my hardware), I think I'd rather just either pony up for a used M3 Ultra 512GB, wait for an M5 Ultra (hoping they bring back 512GB config on new setup), or do some old dual socket Xeon or Epyc 8/12-channel DDR4 setup where I can still get bandwidth speeds in the hundreds of GB/s.

    What kinds of models are you running over 128GB, and what kind of speeds are you seeing, if you don't mind me asking?

    • Until now I have not run models that do not fit in 128 GB.

      I have an Epyc server with 128 GB of high-throughput DRAM, which also has 2 AMD GPUs with 16 GB of DRAM each.

      Until now I have experimented only with models that can fit in this memory, e.g. various medium-size Qwen and Gemma models, or gpt-oss.

      But I am curious about how bigger models behave, e.g. GLM-5.1, Qwen3.5-397B-A17B, Kimi-K2.6, DeepSeek-V3.2, MiniMax-M2.7. I am also curious about how the non-quantized versions of the models with around 120B parameters behave, e.g such versions of Nemotron and Qwen. It is said that quantization to 8 bits or even to 4 bits has negligible effects, but I want to confirm this with my own tests.

      There is no way to test big models or non-quantized medium models at a reasonable cost, otherwise than with weights read from SSDs. For some tasks, it may be preferable to use a big model at a slow speed, if that means that you need less attempts to obtain something useful. For a coding assistant, it may be possible to batch many tasks, which will progress simultaneously during a single pass over the SSD data.

      For now I am studying llama.cpp in order to determine how it can be modified to achieve the maximum performance that could be reached with SSDs.