Comment by programjames

2 years ago

I too "read Vaswani et al. (2017) multiple times, carefully, and was quite unable to grasp what 'attention' was supposed to be doing. (I could follow the math.) I also read multiple tutorials, for multiple intended audiences, and got nothing from them."

It took years before I finally realized it was just a kernel smoothing (though I never used quite so precise language), all because of a poorly written paper. This is what I mean when I say almost every ML paper is trash. "Attention is All You Need" is even way better than most---ever read the Adam paper?

I think that's untrue and unfair. I don't think anyone quite knows what attention is so completely as to simplify it to "just a kernel smoothing". For a great example, the Transformer Circuits team have 2022 research showing a bit more detail about how attention heads work in toy models: https://transformer-circuits.pub/2022/in-context-learning-an...

I think the original intuition for attention was noting long-term information decay occurring in RNNs and realizing how in seq-to-seq language translation models you often need to "attend" to different parts of the input stream in order to match to the next output token, i.e. languages sometimes put functional words in different orders. Transformer Attention as we know it today was one of a few competing models, iirc, for trying to handle this issue.

To that end, lots of kernel smoothers have been designed and tested, but attention came out of a line of research aimed to provide explicit degrees of freedom to allow recurrent neural networks to make use of a larger "memory" through analogy to how computers have read and write capabilities on shared state.

I always say that both this and the BERT paper are breakthrough contributions, but quite awful papers (when we talk about literally the papers, not the discoveries or the software). They're quite badly written and explained (and I don't think they're better than most, at least in NLP which is what I typically read) and they both feel like post hoc rationalizations for massive trial and error. This is common in papers coming from big industry labs, to be honest. I tend to find papers from academia better written, although I may be biased due to being an academic myself.

  • Masking is all you need would be a better description.

    • What is "masking" in a paper that also has a section dedicated to mask segmentation ("masking" as in creating segmentation masks)?

I would say the opposite. This paper was a very easy read, totally clear from the first reading what it is about, etc.

The background matters. Attention was already very well known in the community (machine translation), so nothing new for this paper, and it was written for such an audience which already knows these basics concepts like attention.

If you want to learn about attention, read some of the actual background papers which introduced it.

Possibly the authors did not have a mental model about why the model worked. Attention, keys and heads may have been posthoc rationalizations. The alchemy stage may be comical but necessary

  • I think this misses important history. This was a machine translation paper, and we were already using seq2seq RNNs with attention at the time. They didn't coin the term attention, they just realized that you could use attention from a sequence to itself. Terminology and understanding are always super path-dependent.

> I get more understanding out of "it's a kind of kernel smoothing" than "it's as though an associative array were continuous", that doesn't mean everyone will, or even that many people will. (My educational trajectory was weird.) But the sheer opacity of this literature is I think a real problem. (Cf. Phuong and Hutter 2022.)

As a non ML person but a programmer the key, value, query concepts made more sense to me. But I admit I don’t fully get why it works other than “lots of neurons training on how every combo of tokens relate to each other.

> This is what I mean when I say almost every ML paper is trash

Papers don't use the term you are familiar so they're trash...?

  • No, they're poor at explaining. Have you read the Adam paper? The key concept is the signal to noise ratio, but it's only mentioned on the third page in a paragraph that nearly covers the screen.

  • In a scientific paper you either define everything or you give references to other papers that define them.

    Not even simple terms like natural numbers should be assumed. Some include the number 0, some do not. But it does not matter as long as you provide a definition: let's talk about whole positive numbers including 0.

    Definitions are paramount to quality science and research, otherwise very simple disagreements and misunderstandings derive from the lack of a basic set of knowledge.

did you mean kernel regression rather than kernel smoothing? I ask because: https://d2l.ai/chapter_attention-mechanisms-and-transformers...

Quoting from a previous section

> The attention mechanism allows us to aggregate data from many (key, value) pairs. So far our discussion was quite abstract, simply describing a way to pool data. We have not explained yet where those mysterious queries, keys, and values might arise from. Some intuition might help here: for instance, in a regression setting, the query might correspond to the location where the regression should be carried out. The keys are the locations where past data was observed and the values are the (regression) values themselves