Comment by roadside_picnic
7 days ago
To measure the performance gains on a local machine (or even standard cloud GPU setup), since you can't run this in parallel with the same efficiency you could in a high-ed data center, you need to compare the number of calls made to each model.
In my experiences I'd seen the calls to the target model reduced to a third of what they would have been without using a draft model.
You'll still get some gains on a local model, but they won't be near what they could be theoretically if everything is properly tuned for performance.
It also depends on the type of task. I was working with pretty structured data with lots of easy to predict tokens.
No comments yet
Contribute on Hacker News ↗