Comment by zozbot234

3 hours ago

> For instance my DS4F inference on the DGX Spark does prefill at 350 t/s and at 200 t/s on already large contexts. But decodes at 13 t/s.

You should run a multi-session batched decode on that DGX unless your 13 t/s decode is already running into thermal or power limits, which I don't believe it is. (To be clear, this is a real issue on Apple Silicon machines: batched decode does not seem to unlock higher aggregate tok/s unless you're specifically trying to mitigate the drawbacks of slow streamed inference. Especially on the M5 laptops, thermal/power throttling places an early limit on your total compute.

The jury is still out on Strix Halo, but I think batched decode may turn out to be quite useful there since the bandwidth bottleneck is even more constraining there.)

0 comments

zozbot234

No comments yet

Contribute on Hacker News ↗