← Back to context

Comment by zozbot234

10 hours ago

> But I think the key is that in the standard autoregressive case we get memory bandwidth bound, so there are tons of idle compute resources.

Right, this is the same way batching works. It's "free" until we exhaust available compute resources, at which point decode throughput becomes compute bound. (This is a good place to be, because scaling out compute is a lot easier than adding fast VRAM.) This is why MTP is mostly useful when you have one or few users, which means compute is abundant. When you're running large batches you're better off using that compute to grow your batch size.

Of course, batch size is usually limited by things like bulky KV caches. So perhaps MTP has some residual use in that setting. But if you're sharing cached context in a subagent swarm, or running a model like the recent DeepSeek V4 with its tiny KV cache, you can go a lot further in processing a larger batch.

You can disaggregate though. So draft models can run on cheaper hardware with less RAM, saving time on the more expensive machines with more RAM.

I think it also gets use in the /fast modes the providers sell at higher cost.

  • They probably use it on all models. Fast is probably just a resource pool with less congestion and therefore faster throughput per user but less efficent.