Comment by jasonjmcghee

10 months ago

Importantly they note that using a draft model screws it up and this was my experience. I was initially impressed, then started seeing problems, but after disabling my draft model it started working much better. Very cool stuff- it's fast too as you note.

The /think and /no_think commands are very convenient.

3 comments

jasonjmcghee

woadwarrior01 10 months ago

That should not be the case. Speculative decoding is trading off compute for memory bandwidth. The model's output is guaranteed to be the same, with or without it. Perhaps there's a bug in the implementation that you're using.

marcalc 10 months ago

What do you mean by draft model? And how would one disable it? Cheers

_neil 10 months ago

A draft model is something that you would explicitly enable. It uses a smaller model to speculatively generate next tokens, in theory speeding up generation.
Here’s the LM Studio docs on it: https://lmstudio.ai/docs/app/advanced/speculative-decoding