Comment by eru
3 hours ago
The model can have a reasonable good guess of what you are trying to say, and use 'speculative' thinking. Just like CPU's use branch prediction.
In the common case, you say what the model predicted, and thus the model can use its speculative thinking. In the rare case where you deviated from the prediction, the model thinks from scratch.
(You can further cut down on latency, by speculatively thinking about the top two predictions, instead of just the top prediction. Just costs you more parallel compute.)
This is also all very similar to a chess player who thinks about her next turn, on your turn.
No comments yet
Contribute on Hacker News ↗