Comment by naasking
3 hours ago
It depends on the type of MTP. If you're using two models, draft + full, then arguably yes, the larger model isn't providing much benefit if you really are seeing 100% acceptance rates. There are other forms of speculative decoding that work within the larger model by itself though, eg. Qwen has additional speculative decoding attention heads, so there is no secondary drafting model.
No comments yet
Contribute on Hacker News ↗