Comment by Havoc

1 day ago

>a faster speculator (also known as the draft model) proposes multiple tokens ahead, and the target model verifies them in parallel in a single forward pass

TIL. Bit of an aha moment - never understood till now how a big model can verify faster than it can generate

1 comment

Havoc

woadwarrior01 1 day ago

As with almost everything else in CS, it's a tradeoff. Pre-fill is compute bound, decoding is memory bandwidth bound. Speculative decoding works when the draft model is more often right that wrong, because most architectures have a lot more compute, compared to memory bandwidth.