Comment by joaogui1
3 years ago
The attention matrix is computed based on all tokens in the context, so it kind of functions non-parametrically (but over the batch instead of over the whole training dataset)
3 years ago
The attention matrix is computed based on all tokens in the context, so it kind of functions non-parametrically (but over the batch instead of over the whole training dataset)
No comments yet
Contribute on Hacker News ↗