← Back to context

Comment by samuelknight

6 hours ago

Some of these comments miss the advantage of diffusion. This is will have a big impact on edge devices, such as your phone or the GPU in your computer.

An LLM's decoder computes tokens one-at-a-time because attention has to account for each previous token. The existing LLM decoders scale well when you have enough load to batch many inferences together. Diffusion of limited benefit there. On edge you have a different problem: your inference accelerator is starved while sloshing GB of weights back and forth from RAM. That's because the consumer RAM like LPDDRx/GDDRx is lower bandwidth than HBM, and the requests are serial so you can't batch compute common weights. Diffusion can compute tokens in parallel which relieves the memory bandwidth bottle neck.

Edge devices don't just have limited memory bandwidth though, they also have very limited compute. To the extent where you don't actually need all that much batching to saturate their viable compute and run into obvious thermal/power limits. (It's just not true that "requests are inherently serial" in edge inference; any time you have multiple requests (i.e. "chats") in flight, batching becomes applicable if you have enough memory capacity for the KV caches.) I'm not sure how diffusion models are supposed to help there, if they simply take more compute for lower-quality outcomes and a dubious saving in memory bandwidth.

You’re mostly right but conflating attention with autoregressive/causal which is the real issue that prevents you from using more compute

You can use diffusion with attention, and this model does in fact use attention