Comment by woadwarrior01
1 day ago
As with almost everything else in CS, it's a tradeoff. Pre-fill is compute bound, decoding is memory bandwidth bound. Speculative decoding works when the draft model is more often right that wrong, because most architectures have a lot more compute, compared to memory bandwidth.
No comments yet
Contribute on Hacker News ↗