← Back to context

Comment by jasonjmcghee

10 months ago

Importantly they note that using a draft model screws it up and this was my experience. I was initially impressed, then started seeing problems, but after disabling my draft model it started working much better. Very cool stuff- it's fast too as you note.

The /think and /no_think commands are very convenient.

That should not be the case. Speculative decoding is trading off compute for memory bandwidth. The model's output is guaranteed to be the same, with or without it. Perhaps there's a bug in the implementation that you're using.