Comment by cubefox

2 days ago

> What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?

It is only useful for inference and doesn't help with pretraining. Which actually points to speculative decoding not being sufficiently general, as the same underlying property (some sequences of tokens are easy to predict) could be exploited for training as well. See here: https://goombalab.github.io/blog/2025/hnet-future/#d-footnot...

3 comments

cubefox

Zacharias030 1 day ago

There is no reason that it couldn’t be beneficial for training though.

cubefox 21 hours ago
Except that speculative decoding is de facto only an inference time optimization. But the H-Net architecture from the previous reference, which doesn't require tokens or speculative decoding, does something similar both for inference and training.
- Zacharias030 8 hours ago
  
  Yes, but the discussion is about Multi-Token Prediction (Gloeckle et al. 2024) which is only incidentally useful for speculative decoding.