Comment by fulafel

5 hours ago

AIUI you run the checks of several predicted tokens in lockstep, and the computation for each token is served by the same data loaded from memory. In normal execution, each token would depend on the previous one, precluding the parallelization and causing much more per-token memory traffic.

So this is a case of trading off idle compute capacity that's waiting for the bottleneck (memory access).