Comment by cubefox
2 days ago
> What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?
It is only useful for inference and doesn't help with pretraining. Which actually points to speculative decoding not being sufficiently general, as the same underlying property (some sequences of tokens are easy to predict) could be exploited for training as well. See here: https://goombalab.github.io/blog/2025/hnet-future/#d-footnot...
There is no reason that it couldn’t be beneficial for training though.
Except that speculative decoding is de facto only an inference time optimization. But the H-Net architecture from the previous reference, which doesn't require tokens or speculative decoding, does something similar both for inference and training.
Yes, but the discussion is about Multi-Token Prediction (Gloeckle et al. 2024) which is only incidentally useful for speculative decoding.