Comment by joaogui1

3 years ago

The attention matrix is computed based on all tokens in the context, so it kind of functions non-parametrically (but over the batch instead of over the whole training dataset)

0 comments