Comment by daemonologist

1 hour ago

    > can it be slower than without speculative decoding in worst case then?

Yes - running the draft model costs compute and memory bandwidth, and running the drafted futures through the main model costs compute. If the draft model were really inaccurate or you're already compute-limited (usually: running large batches) you would expect some slowdown.

In practice, for single-user (non-batched) inference with a working configuration, you pretty much always get some speedup. For non-coding tasks I've seen it be nearly a wash for some people, in which case you might want to avoid it due to the extra memory usage (you'd rather use that memory to run a bigger quant/model, even at a slightly lower speed).

0 comments

daemonologist

No comments yet

Contribute on Hacker News ↗