Comment by bawana
6 days ago
How is speculative decoding helpful if you still have to run the full model against which you check the results?
6 days ago
How is speculative decoding helpful if you still have to run the full model against which you check the results?
So the inference speed at low to medium usage is memory bandwidth bound, not compute bound. By “forecasting” into the future you do not increase the memory bandwidth pressure much but you use more compute. The compute is checking each potential token in parallel for several tokens forward. That compute is essentially free though because it’s not the limiting resource. Hope this makes sense, tried to keep it simple.