Comment by pu_pe

20 hours ago

So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?

17 comments

pu_pe

tarruda 19 hours ago

They also published draft models for E4B and E2B. For those, the draft models are only 78m parameters: https://huggingface.co/google/gemma-4-E4B-it-assistant

coder543 19 hours ago

MTP requires a separate KV cache, so there is more memory overhead than just the weights of the MTP model, but it's a manageable amount.

a_e_k 19 hours ago
From the linked post, it didn't read like a separate KV cache was needed:
> The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.
- coder543 19 hours ago
  
  That's great news. That has not been the case with other MTP implementations like Qwen3.5, but I see the section in the article saying Google introduced some architectural optimizations to make this possible.

furyofantares 18 hours ago

Is it really no quality degradation?

I'm curious where my understanding is wrong, but I didn't think you necessarily got the exact same output with how I understand speculative decoding to be used. I thought that if the small model produces tokens that are "good enough", meaning within the top few tokens the larger model produces, they're accepted.

I thought it doesn't necessarily have to produce the exact same token the larger model would have produced to be accepted (and that requiring this would reduce the hit rate by a lot.) Just one the top model could have produced with whatever top-k and temperature settings.

Klaus23 17 hours ago
It really is. This is because LLMs with a single output/user are strongly bandwidth limited. Although the hardware can generate multiple tokens simultaneously, it is slowed down if the tokens depend on each other, as is the case with regular text generation.
The draft model essentially predicts the next token quickly, enabling you to start generating the subsequent token in parallel. If the guess is right, the second generated token is correct. If it is wrong, the second generated token is also potentially wrong, so it must be generated again using the correct prior token obtained through the big model.
A poor draft model will simply slow down the process without affecting the output.
- furyofantares 17 hours ago
  
  > If the guess is right
  This is the crux. What makes the guess "right"?
  I think the acceptance criteria is not that the token is exactly the token the big model would have produced. It's accepted of the big model verifies that the probability of that token was high enough.
  How close it is to the same output (or same distribution of outputs) you'd get from running the big model would be dependent on temperature, top-k, top-p settings, or other inference parameters.
  
  7 replies →
petu 17 hours ago

Speculative decoding batches multiple completions on all possible outcomes (0/1/2 draft tokens accepted) and sees if big model deviates at any point -- thus verifying each token. So there's no difference in output.

ac29 13 hours ago

Memory and compute/energy overhead

moffkalast 16 hours ago

It's based on taking advantage of spare compute if you have it. A tiny model generates a few steps ahead first, then the large one runs batch inference on all of those at once as if you are at that point in time. If they all check out afterwards it jumps ahead, otherwise it discards and goes onto the next one.

Not sure about this implementation, but conceptually it only works well on very capable GPUs for very predictable output. Typical speedup is about 30%, not sure how google is claiming 250% which is ridiculous.

And if you don't have enough compute, then you get negative speedup from all the extra overhead.