Comment by NitpickLawyer
3 hours ago
Your experience might be a bit dated, depending on when was the last time you tried it. MTP (which is a flavor of spec decoding) is showing really solid improvements on local models, even on consumer hardware.
In fact, as the article mentions, you get the biggest gains at low concurrency (so local should apply), with diminishing returns for higher concurrency (if you think in terms of unit of compute, it's probably better to serve more requests in parallel and get more throughput that way).
Eagle3 was great at low context tho, and this seems to improve things at high context. That's really cool, and hopefully it'll turn oout to be useful at those lengths. Eagle3 is also training dependant, so you could try training your own, if your use-cases diverge enough that 3rd party "generalist" models don't suit your needs. (in general nvda, redhat, etc. have provided general eagle3 models for popular families).
No comments yet
Contribute on Hacker News ↗