Comment by tel

2 years ago

I think that's untrue and unfair. I don't think anyone quite knows what attention is so completely as to simplify it to "just a kernel smoothing". For a great example, the Transformer Circuits team have 2022 research showing a bit more detail about how attention heads work in toy models: https://transformer-circuits.pub/2022/in-context-learning-an...

I think the original intuition for attention was noting long-term information decay occurring in RNNs and realizing how in seq-to-seq language translation models you often need to "attend" to different parts of the input stream in order to match to the next output token, i.e. languages sometimes put functional words in different orders. Transformer Attention as we know it today was one of a few competing models, iirc, for trying to handle this issue.

To that end, lots of kernel smoothers have been designed and tested, but attention came out of a line of research aimed to provide explicit degrees of freedom to allow recurrent neural networks to make use of a larger "memory" through analogy to how computers have read and write capabilities on shared state.