Comment by zozbot234
20 hours ago
> because LLM inference is limited by memory bandwidth not compute, you can run multiple queries in parallel on your graphic card at the same speed as the single one
I don't think this is correct, especially given MoE. You can save some memory bandwidth by reusing model parameters, but that's about it. It's not giving you the same speed as a single query.
No comments yet
Contribute on Hacker News ↗