← Back to context

Comment by tomp

20 hours ago

No, the parent is wrong.

Checking a token is the same as generating it.

The benefit however is in the next (third) token. After generating tokens 1 and 2 (in one turn), you start generating token 3 (and 4). You also get the “real” prediction for token 2. If the “real” prediction matches the MTP (Multi-Token Prediction) from previous turn, you have just generated 3 correct tokens (and another speculative). If not, you’ve now corrected token 2, but token 3 is wrong (it follows the wrong token 2) so you need ti generate it again.

Thanks for the clarification. Your comment made me connect the similarity (in spirit) of Speculative Decoding to Speculative Execution [1] in CPUs. Very cool and clever optimization strategy for LLMs, IMHO.

[1] https://en.wikipedia.org/wiki/Speculative_execution

Does it work to predict tokens 3 and 4 (or 5, 6) in the same way? I wonder how extreme the hit rate drop-off is.