Comment by libraryofbabel
9 hours ago
That’s correct, and yes - not less compute total on the main model (actually slightly more, since checking failed draft tokens costs you compute), but faster because inference is memory-bandwidth bound. And like you I also think of it as like a “mini prefill” (but on top of the existing KV cache, of course); the code is very similar to prefill if you implement a simple toy version yourself.
Most of the complexity in implementing a simple toy version comes from having to get the KV cache back into a good state for the next cycle (e.g. if only the first half of your draft tokens were correct).
No comments yet
Contribute on Hacker News ↗