Comment by littlestymaar
1 day ago
That this kind of approach works is good news for local LLM enthusiasts, as it makes Cloud LLM using this more expensive while local LLM can do so for free up to a point (because LLM inference is limited by memory bandwidth not compute, you can run multiple queries in parallel on your graphic card at the same speed as the single one. Until you become compute-bound of course).
> because LLM inference is limited by memory bandwidth not compute, you can run multiple queries in parallel on your graphic card at the same speed as the single one
I don't think this is correct, especially given MoE. You can save some memory bandwidth by reusing model parameters, but that's about it. It's not giving you the same speed as a single query.
Absolutely, it's not going to work that well for MoE, though today most local models (except Qwen3-30B-A3B) are dense ones.
But even for MoE it will still work: sure the second parallel agent running is going to divide the token rate by almost two, but the reduction is exponentially decreasing and the 30th will almost be free. So if you have enough VRAM to run Qwen3-32B, you can run Qwen3-30B-3A at the same speed as the 32B version but you'll be running a hundred of instances.
Wait, how does this work? If you load in one LLM of 40 GB, then to load in four more LLMs of 40 GB still takes up an extra 160 GB of memory right?
It will typically be the same 40 GB model loaded in, but called with many different inputs simultaneously