Comment by anonym29
1 day ago
With kv caching, most of the MoE models are very usable in claude code. Active params seems to dominate TG speeds, and unlike PP, TG speeds don't decay much even with context length growth.
Even moderately large and capable models like gpt-oss:120b and Qwen3-Next-80B have pretty good TG speeds - think 50+ tok/s TG on gpt-oss:120b.
PP is the main thing that suffers due to memory bandwidth, particularly for very long PP stretches on typical transformers models, per the quadratic attention needs, but like I said, with KV caching, not a big deal.
Additionally, newer architectures like hybrid linear attention (Qwen3-Next) and hybrid mamba (Nemotron) exhibit much less PP degradation over longer contexts, not that I'm doing much long context processing thanks to KV caching.
My 4080 is absolutely several times faster... on the teeny tiny models that fit on it. Could I have done something like a 5090 or dual 3090 setup? Sure. Just keep in mind I spent considerably less on my entire Strix Halo rig (a Beelink GTR 9 Pro, $1980 w/ coupon + pre-order pricing) than a single 5090 ($3k+ for just the card, easily $4k+ for a complete PCIe 5 system), it draws ~110W on Vulkan workloads, and idles below 10W, taking up about as much space as a Gamecube. Comparing it to an $8500 RTX 6000 Pro is a completely nonsensical comparison and was outside of my budget in the first place.
Where I will absolutely give your argument credit: for AI outside of LLMs (think genAI, text2img, text2vid, img2img, img2vid, text2audio, etc), Nvidia just works while Strix Halo just doesn't. For ComfyUI workloads, I'm still strictly using my 4080. Those aren't really very important to me, though.
Also, as a final note, Strix Halo's theoretical MBW is 256 GB/s, I routinely see ~220 GB/s real world, not 200 GB/s. Small difference when comparing to GDDR7 on a 512 bit bus, but point stands.
No comments yet
Contribute on Hacker News ↗