Comment by puilp0502

5 months ago

What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?

22 comments

puilp0502

jychang 5 months ago

Speculative decoding! It makes inference a LOT faster.

Instead of generating tokens one at a time, you generate the second one as well, and then use speculative decoding on that second token (instead of having it be produced by a draft model like Qwen 0.6b). If the token is checked and is correct, then the 2nd token gets generated MUCH faster.

If it's wrong, you have to generate it again the normal way (a lot slower than just checking it). Usually, it's correct, so inference is a lot faster.

stingraycharles 5 months ago
Because then the second token only needs to be checked, not generated, as it’s already generated? And it’s much faster to generate multiple tokens at the same time than one at a time? Is that the idea?
I’m not an expert on LLMs, just a user.
- tomp 5 months ago
  
  No, the parent is wrong.
  Checking a token is the same as generating it.
  The benefit however is in the next (third) token. After generating tokens 1 and 2 (in one turn), you start generating token 3 (and 4). You also get the “real” prediction for token 2. If the “real” prediction matches the MTP (Multi-Token Prediction) from previous turn, you have just generated 3 correct tokens (and another speculative). If not, you’ve now corrected token 2, but token 3 is wrong (it follows the wrong token 2) so you need ti generate it again.
  
  2 replies →
- bdcs 5 months ago
  
  It relies on an “unintuitive observation”[0] that you can run batches basically for free (up to a limit). So if you only run one inference, you batch it plus a lot of guesses and, if you guess right, can speed up the inference by the number of guesses. If you guess wrong, you're back to regular speed (and still fully correct).
  [0] https://x.com/karpathy/status/1697318534555336961
- namibj 5 months ago
  
  Basically you can generate the next two tokens at once in the same matmul, and rollback to one-at-a-time when your generation said you guessed wrong (as that will mean the second of your pair you generated was generated based on revoked context).
- Zacharias030 5 months ago
  
  yes, if you know the sequence of tokens ahead of time you can verify them about as quickly as you can generate one more token because of the parallelism benefits.
  If you don’t know the future tokens though, then you can’t, and blind guessing of tokens is infeasible because the vocabulary contains circa 100k possible different tokens.
moffkalast 5 months ago
Hmm but isn't the checking only required because the draft model is not the same model and can only speculate what the main one is thinking, hence the name? If the main model generates two tokens itself, then how can it be wrong about its own predictions?
- jychang 5 months ago
  
  Because if you generate token n+1 with all 48 layers of Qwen3-Next and 80 billion params, and also generate token n+2 with the 1 MTP layer at 2bil params... that n+2 token can be much lower quality than the n+1 token but mostly correct.
  Let's say you have a model that generates the string "The 44th president of the United States is ___ ___". Your model will generate "Barack" as the n+1 token, and the MTP layer probably does a good enough job to generate "Obama" as the n+2 token (even though that MTP layer is a mere <2bil parameters in size). Then you just check if "Obama" is correct via the same speculative decoding process, which is a lot faster than if you had to start over from layer 1-48 and generate "Obama" the regular way.
  
  2 replies →
- SonOfLilit 5 months ago
  
  If you ask me to guess an answer, I'll _usually_ produce the same answer as if I had time to think about it deeply, but not always...
  
  1 reply →
- EMM_386 5 months ago
  
  I believe it's something along these lines. The MTP head runs simultaneously and generates a probability list based on what it thinks the results will be, learned during training.
  If n+1 = "Barack" then n+2 = "Obama" (confidence: 0.90) If n+1 = "The" then n+2 = "quick" (confidence: 0.45) If n+1 = "President" then n+2 = "Biden" (confidence: 0.75)
  A threshold is set (say, as 90%) so that if the n+2 prediction is above that (as in the first example) it uses it without having to determine it with the main model. It's confident "enough".
  
  1 reply →
- eldenring 5 months ago
  
  the 2nd token is generated without knowing what token was chosen for the 1st token

cubefox 5 months ago

> What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?

It is only useful for inference and doesn't help with pretraining. Which actually points to speculative decoding not being sufficiently general, as the same underlying property (some sequences of tokens are easy to predict) could be exploited for training as well. See here: https://goombalab.github.io/blog/2025/hnet-future/#d-footnot...

Zacharias030 5 months ago
There is no reason that it couldn’t be beneficial for training though.
- cubefox 5 months ago
  
  Except that speculative decoding is de facto only an inference time optimization. But the H-Net architecture from the previous reference, which doesn't require tokens or speculative decoding, does something similar both for inference and training.
  
  1 reply →

rfoo 5 months ago

It could be a better draft model than separately trained EAGLE etc for speculative decoding.