Comment by svcrunch
2 years ago
While in Google Research, I worked with two of the authors of the "Attention is All you Need" paper, including the gentleman who chose that title.
As others have pointed out, self-attention was already a known concept in the research community. They don't claim to have invented that. Rather, the authors began by looking at how to improve the power of feed-forward neural networks using a combination of techniques, obtained some exciting results, and then, in the course of ablation studies, discovered that attention was really all you needed!
The title is a play on the Beatles song, "All You Need Is Love".
In terms of expository style, the paper that was most helpful for me was [Formal Algorithms for Transformers](https://arxiv.org/abs/2207.09238) by Phuong and Hutter. Written for clarity and with an emphasis on precision, the motivation section (Section 2) of the paper does a great job of explaining deficiencies in the original paper and subsequent ones.
Interesting paper the one you shared and the justification paragraph on why pseudocode is more important than code in papers is surprising in a positive sense and appears apparent in retrospect. Quote:
"Source code vs pseudocode. Providing open source code is very useful, but not a proper substitute for formal algorithms. There is a massive difference between a (partial) Python dump and well-crafted pseudocode. A lot of abstraction and clean-up is necessary: remove boiler plate code, use mostly single-letter variable names, replace code by math expressions wherever possible, e.g. replace loops by sums, remove (some) optimizations, etc. A well-crafted pseudocode is often less than a page and still essentially complete, compared to often thousands of lines of real source code.'
The problem is that most pseudocode I see is not well crafted, and often seemingly no effort has gone into ensuring that it gives a complete or accurate picture.
Do you have insight into the choice of the term attention, which, according to this article’s author, bears very little resemblance to the human sense of the word (I.e. it is selective and not averaging)?
No.
But to your point, note that in 2020 neuroscientists introduced the Tolman-Eichenbaum Machine (TEM) [1], a mathematical model of the hippocampus that bears a striking resemblance to transformer architecture.
Artem Kirsanov has a very nice piece on TEM, "Can we Build an Artificial Hippocampus?" [2] The link is directly to the spot where he makes the connection to transformers, although you should watch the whole video for context.
Because I wasn't clear on the chronology, I went back and asked one of the "Attention" authors whether mathematical models of the hippocampus inspired their paper? His answer was "no". If TEM was developed without pre-knowledge of transformers, then it's a very deep result IMHO.
[1] https://www.sciencedirect.com/science/article/pii/S009286742...
[2] https://www.youtube.com/watch?v=cufOEzoVMVA&t=1254s
There’s a video[1] of Karpathy recounting an email correspondence he had with with Bahdanau. The email explains that the word “Attention” comes from Bengio who, in one of his final reviews of the paper, determined it to be preferable to Bahdanau’s original idea of calling it “RNNSearch”.
[1] https://youtu.be/XfpMkf4rD6E?t=18m23s
"RNNSearch is all you need" probably wouldn't catch on and we'd still be ChatGPT-less.
2 replies →
Not OP and have no insight, but the thing that caused it to click for me was when I heard “this token attends to that token”. Basically, there’s a new value created that represents how much one thing (in an LLM its tokens) cares about another thing.
Saying “attends to” vs “attention” helped clarify (for me) the mechanics of what’s going on.
An attention layer transforms word vectors by adding information from the other words in the sequence. The amount of information added from each neighboring word is regulated by a weight called the "attention weight". If the attention weight for one of the neighbors is enormously large, then all the information added will be from that word, in contrast, if the attention weight for a neighbor is zero, it will add no information to the word. This is called an 'attention mechanism' since it literally decides which information to pass through the network, i.e. which other words should the model 'pay attention to' when it is considering a particular word.
Mm attention as used in earlier papers makes a lot of more sense with respect to the term... there was several where it was literally used to focus on some part of an image at a higher resolution for example.
> the paper that was most helpful for me was [Formal Algorithms for Transformers](https://arxiv.org/abs/2207.09238)
Interesting but hard to read since it uses a quite unique notations for matrix indexing and multplication. Why???
jakob and ashish were great :)