Comment by firefly2000
2 days ago
If the workload were perfectly parallelizable, your claim would be true. However, if it has serial dependency chains, it is absolutely worth it to compute it quickly and unreliably and verify in parallel
2 days ago
If the workload were perfectly parallelizable, your claim would be true. However, if it has serial dependency chains, it is absolutely worth it to compute it quickly and unreliably and verify in parallel
This is exactly what speculative decoding for LLMs do, and it can yield a nice boost.
Small, hence fast, model predicts next tokens serially. Then a batch of tokens are validated by the main model in parallel. If there is a missmatch you reject the speculated token at that position and all subsequent speculated tokens, take the correct token from the main model and restart speculation from that.
If the predictions are good and the batch parallelism efficiency is high, you can get a significant boost.
I have a question about what "validation" means exactly. Does this process work by having the main model compute the "probability" that it would generate the draft sequence, then probabilistically accepting the draft? Wondering if there is a better method that preserves the distribution of the main model.
> Does this process work by having the main model compute the "probability" that it would generate the draft sequence, then probabilistically accepting the draft?
It does the generation as normal using the draft model, thus sampling from the draft model's distribution for a given prefix to get the next (speculated) token. But it then uses the draft model's distribution and the main model's distribution for the given prefix to probabilistically accept or reject the speculated token, in a way which guarantees the distribution used to sample each token is identical to that of the main model.
The paper has the details[1] in section 2.3.
The inspiration for the method was indeed speculative execution as found in CPUs.
[1]: https://arxiv.org/abs/2211.17192 Fast Inference from Transformers via Speculative Decoding