Comment by agoose77
2 months ago
I am not an expert by _any_ means, but to provide _some_ intuition — self-attention is ultimately just a parameterised token mixer (see https://medium.com/optalysys/attention-fourier-transforms-a-...) i.e. each vector in the output depends upon the corresponding input vector transformed by some function of all the other input vectors.
You can see conceptually how this is similar to a convolution with some simplification, e.g. https://openreview.net/pdf?id=8l5GjEqGiRG
Convolutions are often used in contexts where you want to account for global state in some way. - https://openreview.net/pdf?id=8l5GjEqGiRG
No comments yet
Contribute on Hacker News ↗