Comment by webdevver

1 month ago

[flagged]

3 comments

webdevver

I'm really glad I bought Strix Halo. It's a beast of a system, and it runs models that an RTX 6000 Pro costing almost 5x as much can't touch. It's a great addition to my existing Nvidia GPU (4080) which can't even run Qwen3-Next-80B without heavy quantization, let alone 100B+, 200B+, 300B+ models, and unlike GB10, I'm not stuck with ARM cores and the ARM software ecosystem.

To your point though, if the successors to Strix Halo, Serpent Lake (x86 intel CPU + Nvidia iGPU) and Medusa Halo (x86 AMD CPU + AMD iGPU) come in at a similar price point, I'll probably go with Serpent Lake, given the specs are otherwise similar (both are looking at 384-bit unified memory bus to LPDDR6 with 256GB unified memory options). CUDA is better than ROCm, no argument there.

That said, this has nothing to do with the (now resolved) issue I was experiencing with LM Studio not respecting existing Developer Mode settings with this latest update. There are good reasons to want to switch between different back-ends (e.g. debugging whether early model release issues, like those we saw with GLM-4.7-Flash, are specific to Vulkan - some of them were in that specific example). Bugs like that do exist, but I've had even fewer stability issues on Vulkan than I've had on CUDA on my 4080.

webdevver 1 month ago
im sure the clang compile times are very respectable, but for llms? paltry 200gb/sec compared to the rtx 6000 pros 1.8tb.
sure you can load big(-ish) models on it, but if youre getting <10 tokens per second, that severely limits how useful it is.
- anonym29 1 month ago
  
  With kv caching, most of the MoE models are very usable in claude code. Active params seems to dominate TG speeds, and unlike PP, TG speeds don't decay much even with context length growth.
  Even moderately large and capable models like gpt-oss:120b and Qwen3-Next-80B have pretty good TG speeds - think 50+ tok/s TG on gpt-oss:120b.
  PP is the main thing that suffers due to memory bandwidth, particularly for very long PP stretches on typical transformers models, per the quadratic attention needs, but like I said, with KV caching, not a big deal.
  Additionally, newer architectures like hybrid linear attention (Qwen3-Next) and hybrid mamba (Nemotron) exhibit much less PP degradation over longer contexts, not that I'm doing much long context processing thanks to KV caching.
  My 4080 is absolutely several times faster... on the teeny tiny models that fit on it. Could I have done something like a 5090 or dual 3090 setup? Sure. Just keep in mind I spent considerably less on my entire Strix Halo rig (a Beelink GTR 9 Pro, $1980 w/ coupon + pre-order pricing) than a single 5090 ($3k+ for just the card, easily $4k+ for a complete PCIe 5 system), it draws ~110W on Vulkan workloads, and idles below 10W, taking up about as much space as a Gamecube. Comparing it to an $8500 RTX 6000 Pro is a completely nonsensical comparison and was outside of my budget in the first place.
  Where I will absolutely give your argument credit: for AI outside of LLMs (think genAI, text2img, text2vid, img2img, img2vid, text2audio, etc), Nvidia just works while Strix Halo just doesn't. For ComfyUI workloads, I'm still strictly using my 4080. Those aren't really very important to me, though.
  Also, as a final note, Strix Halo's theoretical MBW is 256 GB/s, I routinely see ~220 GB/s real world, not 200 GB/s. Small difference when comparing to GDDR7 on a 512 bit bus, but point stands.