Comment by jasonjmcghee
10 months ago
Importantly they note that using a draft model screws it up and this was my experience. I was initially impressed, then started seeing problems, but after disabling my draft model it started working much better. Very cool stuff- it's fast too as you note.
The /think and /no_think commands are very convenient.
That should not be the case. Speculative decoding is trading off compute for memory bandwidth. The model's output is guaranteed to be the same, with or without it. Perhaps there's a bug in the implementation that you're using.
What do you mean by draft model? And how would one disable it? Cheers
A draft model is something that you would explicitly enable. It uses a smaller model to speculatively generate next tokens, in theory speeding up generation.
Here’s the LM Studio docs on it: https://lmstudio.ai/docs/app/advanced/speculative-decoding