← Back to context

Comment by roadside_picnic

7 days ago

I'm sure there are countless tricks, but one that can implemented at home, and I know plays a major part in Cerebras' performance is: speculative decoding.

Speculative decoding uses a smaller draft model to generate tokens with much less compute and memory required. Then the main model will accept those tokens based on the probability it would have generated them. In practice this case easily result in a 3x speedup in inference.

Another trick for structured outputs that I know of is "fast forwarding" where you can skip tokens if you know they are going to be the only acceptable outputs. For example, you know that when generating JSON you need to start with `{ "<first key>": ` etc. This can also lead to a ~3x speedup in when responding in JSON.

gpt-oss-120b can be used with gpt-oss-20b as speculative drafting on LM Studio

I'm not sure it improved the speed much

  • To measure the performance gains on a local machine (or even standard cloud GPU setup), since you can't run this in parallel with the same efficiency you could in a high-ed data center, you need to compare the number of calls made to each model.

    In my experiences I'd seen the calls to the target model reduced to a third of what they would have been without using a draft model.

    You'll still get some gains on a local model, but they won't be near what they could be theoretically if everything is properly tuned for performance.

    It also depends on the type of task. I was working with pretty structured data with lots of easy to predict tokens.

  • It depends a lot on the type of conversation. A lot of ChatGPT load appears to be therapy talk that even small models can correctly predict.

  • a 6:1 parameter ratio is too small for specdec to have that much of an effect. You'd really want to see 10:1 or even more for this to start to matter

    • You're right on ratios, but actually the ratio is much worse than 6:1 since they are MoEs. The 20B has 3.6B active, and the 120B has only 5.1B active, only about 40% more!