Comment by zozbot234

7 days ago

Prompt processing could be sped up with NPU inference. The Strix Halo NPU is a bit weird (XDNA 2, so the architecture is spatial dataflow and programmable interconnects), but it's there. See https://github.com/FastFlowLM/FastFlowLM (which is directly supported by https://lemonade-server.ai/ https://github.com/lemonade-sdk/lemonade ) for one existing project that's planning to support the NPU for the prompt processing phase. (Do note that FLM are providing proprietary NPU kernels under a non-free license, so make sure that this fits your needs before use.)

I’ve seen this claim a lot, but I’m skeptical. Has anyone actually published benchmarks showing a big speedup from using the NPU for prefill?

AMD’s own marketing numbers say the NPU is about 50 TOPS out of 126 TOPS total compute for the platform. Even if you hand-wave everything else away, that caps the theoretical upside at around ~1.6x.

But that assumes:

1. Your workload maps cleanly onto the NPU’s 8-bit fast path.

2. There’s no overhead coordinating the iGPU + NPU.

My expectation is the real-world gain would be close to 0, but I'd love to be proven wrong!