← Back to context

Comment by Tepix

5 days ago

You can get two Strix Halo PCs with similar specs for that $4000 price. I just hope that prompt preprocessing speeds will continue to improve, because Strix Halo is still quite slow in that regard.

Then there is the networking. While Strix Halo systems come with two USB4 40Gbit/s ports, it's difficult to

a) connect more than 3 machines with two ports each

b) get more than 23GBit/s or so per connection, if you're lucky. Latency will also be in the 0.2ms range, which leaves room for improvement.

Something like Apple's RDMA via Thunderbolt would be great to have on Strix Halo…

As you allude, the prompt processing speeds are a killer improvement of the Spark which even 2 Strix Halo boxes would not match.

Prompt processing is literally 3x to 4x higher on GPT-OSS-120B once you are a little bit into your context window, and it is similarly much faster for image generation or any other AI task.

Plus the Nvidia ecosystem, as others have mentioned.

One discussion with benchmarks: https://www.reddit.com/r/LocalLLaMA/comments/1oonomc/comment...

If all you care about is token generation with a tiny context window, then they are very close, but that’s basically the only time. I studied this problem extensively before deciding what to buy, and I wish Strix Halo had been the better option.

  • Prompt processing could be sped up with NPU inference. The Strix Halo NPU is a bit weird (XDNA 2, so the architecture is spatial dataflow and programmable interconnects), but it's there. See https://github.com/FastFlowLM/FastFlowLM (which is directly supported by https://lemonade-server.ai/ https://github.com/lemonade-sdk/lemonade ) for one existing project that's planning to support the NPU for the prompt processing phase. (Do note that FLM are providing proprietary NPU kernels under a non-free license, so make sure that this fits your needs before use.)

    • I’ve seen this claim a lot, but I’m skeptical. Has anyone actually published benchmarks showing a big speedup from using the NPU for prefill?

      AMD’s own marketing numbers say the NPU is about 50 TOPS out of 126 TOPS total compute for the platform. Even if you hand-wave everything else away, that caps the theoretical upside at around ~1.6x.

      But that assumes:

      1. Your workload maps cleanly onto the NPU’s 8-bit fast path.

      2. There’s no overhead coordinating the iGPU + NPU.

      My expectation is the real-world gain would be close to 0, but I'd love to be proven wrong!

  • Then again, I have a RTX 5090 + 96GB DDR5-6000 that crushes the spark on prompt processing of gpt-oss-120b (something like 2-3x faster), while token generation is pretty close. The cost I paid was ~$3200 for the entire computer. With the currently inflated RAM prices, it would probably be closer to the dell.

    So while I think the Strix Halo is a mostly useless machine for any kind of AI, and I think the spark is actually useful, I don't think pure inference is a good use case for them.

    It probably only makes sense as a dev kit for larger cloud hardware.

  • Could I get your thoughts on the Asus GX10 vs. spending on GPU compute? It seems like one could get a lot of total VRAM with better memory bandwidth and make PCIe the bottleneck. Especially if you already have a motherboard with spare slots.

    I'm trying to better understand the trade offs, or if it depends on the workload.

    • Run a model at all, run a model fast, run a model cheap. Pick 2.

      With LLM workloads, you can run some of the larger local models (at all) and you can run them cheap on the unified 128G RAM machines (Strix Halo/Spark) - for example, gpt-oss-120b. At 4bit quantization given it's an MoE that's natively trained at NVFP4, it'll be pretty quick. Some of the other MoEs with highly compressed active parameter models will also be quick as well. But things will get sluggish as the active parameters increase. The best way to run these models is with a multi-GPU rig so you get speed and VRAM density at once, but that's expensive.

      With other workloads such as image/video generation, the unified vram doesn't help as much and the operations themselves intrinsically run better on the beefier GPU cores, in part because many of the models are relatively small compared to LLM (6B-20B active parameters) but generating from those parameters is definitely GPU compute intensive. So you get infinitely more from a 3090 (maybe even a slightly lesser card) than you do from a unified memory rig.

      If you're running a mixture of LLM and image/video generation workloads, there is no easy answer. Some folks on a budget opt for a unified memory machine with an eGPU to get the best of both worlds, but I hear drivers are an issue. Some folks use the Mac studios which while quite fast force you to be inside the Metal ecosystem rather than CUDA and aren't as pleasant for dev or user ecosystem. Some folks build a multi CPU server rig with a ton of vanilla RAM (used to be popular for folks who wanted to run DeepSeek before RAM prices spiked). Some folks buy older servers with VRAM dense but dated cards (thing Pascal, Volta, etc, or AMD MI50/100). There's no free lunch with any of these options, honestly.

      If you don't have a very clear sense of something you can buy that you won't regret, it's hard to go wrong using any of the cloud GPU hyperscalers (Runpod, Modal, Northflank, etc) or something like Fal or Replicate where you can try out the open source models and pay per request. Sure, you'll spend a bit more on unit costs, but it'll force you to figure out if you have your workloads figured out enough to where the pain of having it in the cloud stings enough to where you want to buy and own the metal -- if the answer is no, even if you could afford it, you'll often be most happiest just using the right cloud service!

      Ask me how I figured out all of the above the hard way...

      3 replies →

    • It depends entirely on what you want to do, and how much you're willing to deal with a hardware setup that requires a lot of configuration. Buying several 3090s can be powerful. Buying one or two 5090s can be awesome, from what I've heard.

The primary advantage of the DGX box is that it gives you access to the nVidia ecosystem. You can develop against it almost like a mini version of the big servers you're targeting.

It's not really intended to be a great value box for running LLMs at home. Jeff Geerling talks about this in the article.

  • Exactly this. I'm not sure why people keep drumming the "a Mac or Strix Halo is faster/cheaper" drum. Different market.

    If I want to do hobby / amateur AI research or do stuff with fine tuning models etc, learn the tooling. I'm better off with the DG10 than AMD or Apple's systems.

    The Strix Halo machines look nice. I'd like one of those too. Especially if/when they ever get around to getting it into a compelling laptop.

    But I ordered the ASUS Ascent DG10 machine (since it was more easily available for me than the other versions of these) because I want to play around with fine tuning open weight models, learning tooling, etc.

    That and I like the idea of having a (non-Apple) Aarch64 linux workstation at home.

    Now if the courier would just get their shit together and actually deliver the thing...

    • I have this device, it's exactly as you say. This is a device for AI research and development. My buddies mac ultra beats it squarely for inference workloads, but for real tinkering it can't be beat.

      I've used it to fine tune 20+ models in the last couple of weeks. Neither a Mac or Strix Halo even try to compete.

NVFP4 (and to a lesser extent, MXFP8) work, in general. In terms of usable FLOPS the DGX Spark and the GMTek EVO-X2 both lose to the 5090, with NCCL and OpenMPI set up the DGX is still the nicest way to dev for our SBSA future. Working on that too, harder problem.