Comment by tomp
3 days ago
No, the parent is wrong.
Checking a token is the same as generating it.
The benefit however is in the next (third) token. After generating tokens 1 and 2 (in one turn), you start generating token 3 (and 4). You also get the “real” prediction for token 2. If the “real” prediction matches the MTP (Multi-Token Prediction) from previous turn, you have just generated 3 correct tokens (and another speculative). If not, you’ve now corrected token 2, but token 3 is wrong (it follows the wrong token 2) so you need ti generate it again.
To clarify, I should have stated: "Instead of generating tokens one at a time, you generate the second one as well WITH MTP, and then use speculative decoding on that second token (instead of having the second token be produced by a draft model like Qwen 0.6b). If the FIRST MTP token is checked and is correct, then the second token gets generated MUCH faster."
Thanks for the clarification. Your comment made me connect the similarity (in spirit) of Speculative Decoding to Speculative Execution [1] in CPUs. Very cool and clever optimization strategy for LLMs, IMHO.
[1] https://en.wikipedia.org/wiki/Speculative_execution
Does it work to predict tokens 3 and 4 (or 5, 6) in the same way? I wonder how extreme the hit rate drop-off is.