← Back to context

Comment by zozbot234

8 hours ago

Speculative decoding is not that useful at scale, it's mostly about making local single-user inference faster. When you're batching multiple inferences together, that's already as fast as the verification you have to perform w/ speculative decoding.

The future will have LLMs running local at your laptop/devices. If not almost exclusively then at least for 90-95% of the tasks. Speculative decoding is just one technique out of many existing and more to come that will make this even more viable. The gap is closing on both fronts. Software gets faster/more clever. Hardware gets faster and smaller. The single user story is the story. I'm obviously speculating myself, but that's how I see it.