Comment by zozbot234

1 day ago

> because LLM inference is limited by memory bandwidth not compute, you can run multiple queries in parallel on your graphic card at the same speed as the single one

I don't think this is correct, especially given MoE. You can save some memory bandwidth by reusing model parameters, but that's about it. It's not giving you the same speed as a single query.

Absolutely, it's not going to work that well for MoE, though today most local models (except Qwen3-30B-A3B) are dense ones.

But even for MoE it will still work: sure the second parallel agent running is going to divide the token rate by almost two, but the reduction is exponentially decreasing and the 30th will almost be free. So if you have enough VRAM to run Qwen3-32B, you can run Qwen3-30B-3A at the same speed as the 32B version but you'll be running a hundred of instances.