Comment by zozbot234
3 hours ago
Edge devices don't just have limited memory bandwidth though, they also have very limited compute. To the extent where you don't actually need all that much batching to saturate their viable compute and run into obvious thermal/power limits. (It's just not true that "requests are inherently serial" in edge inference; any time you have multiple requests (i.e. "chats") in flight, batching becomes applicable if you have enough memory capacity for the KV caches.) I'm not sure how diffusion models are supposed to help there, if they simply take more compute for lower-quality outcomes and a dubious saving in memory bandwidth.
No comments yet
Contribute on Hacker News ↗