Comment by sva_
1 month ago
I don't think its mathematically equivalent or even close because the context/logprobs will be very different, since you only produce 1 token per pass. I'd say the token itself has a lot less information than the signal propagating through the residual stream of transformer blocks.
No comments yet
Contribute on Hacker News ↗