← Back to context

Comment by furyofantares

18 hours ago

Models do not generate tokens. They generate probabilities for each token.

Inference parameters select a token using those.

You can just select the top token all the time or you can do it probabilistically.

How you do that in both the speculative decoding and the main inference changes how likely you get the exact same tokens. And then you can choose to accept only if the token matches exactly, or you can choose to accept if it was reasonably likely to be chosen.

Let's say the main model picked the 2nd most likely token and speculative picked the most likely. You can reject that - but you get less speed up. You can accept it, you get more speed up, but you do change the output. You risk the distribution of your outputs not being what you hope.

I am simplifying. I know in https://arxiv.org/pdf/2302.01318 they specify a probability that you reject a token.

In theory, you could do that and increase the speed at higher temperatures, but it would subtly alter your output based on the draft model preferences. Rather than picking randomly from the main model probabilities, you would have to accept a draft model pick if it is close enough.

As far as I know, this is not used in practice. Currently popular implementations always match the main model output, and the draft model only affects the speed.

  • Here is the line in vLLM's source code that determines if a draft token is accepted:

        accepted = draft_prob > 0 and target_prob / draft_prob >= uniform_prob
    

    It does have a branch that checks only token id equality, which is used if temperature is 0.

    • Good analysis. That's surprising. I always heard that the draft model doesn't affect the output in any way. It seems they do it like this to achieve faster generation. It would be interesting to investigate how this affects the output.

      Edit: I haven't gone through all the code, but they might do something like this: https://arxiv.org/abs/2211.17192 where a draft model is used and the output distribution is tweaked on rejection, resulting in the exact same distribution as the main model.

      1 reply →