Comment by svcrunch

2 years ago

While in Google Research, I worked with two of the authors of the "Attention is All you Need" paper, including the gentleman who chose that title.

As others have pointed out, self-attention was already a known concept in the research community. They don't claim to have invented that. Rather, the authors began by looking at how to improve the power of feed-forward neural networks using a combination of techniques, obtained some exciting results, and then, in the course of ablation studies, discovered that attention was really all you needed!

The title is a play on the Beatles song, "All You Need Is Love".

In terms of expository style, the paper that was most helpful for me was [Formal Algorithms for Transformers](https://arxiv.org/abs/2207.09238) by Phuong and Hutter. Written for clarity and with an emphasis on precision, the motivation section (Section 2) of the paper does a great job of explaining deficiencies in the original paper and subsequent ones.

13 comments

svcrunch

antman 2 years ago

Interesting paper the one you shared and the justification paragraph on why pseudocode is more important than code in papers is surprising in a positive sense and appears apparent in retrospect. Quote:

"Source code vs pseudocode. Providing open source code is very useful, but not a proper substitute for formal algorithms. There is a massive difference between a (partial) Python dump and well-crafted pseudocode. A lot of abstraction and clean-up is necessary: remove boiler plate code, use mostly single-letter variable names, replace code by math expressions wherever possible, e.g. replace loops by sums, remove (some) optimizations, etc. A well-crafted pseudocode is often less than a page and still essentially complete, compared to often thousands of lines of real source code.'

vidarh 2 years ago

The problem is that most pseudocode I see is not well crafted, and often seemingly no effort has gone into ensuring that it gives a complete or accurate picture.

next_xibalba 2 years ago

Do you have insight into the choice of the term attention, which, according to this article’s author, bears very little resemblance to the human sense of the word (I.e. it is selective and not averaging)?

svcrunch 2 years ago

No.
But to your point, note that in 2020 neuroscientists introduced the Tolman-Eichenbaum Machine (TEM) [1], a mathematical model of the hippocampus that bears a striking resemblance to transformer architecture.
Artem Kirsanov has a very nice piece on TEM, "Can we Build an Artificial Hippocampus?" [2] The link is directly to the spot where he makes the connection to transformers, although you should watch the whole video for context.
Because I wasn't clear on the chronology, I went back and asked one of the "Attention" authors whether mathematical models of the hippocampus inspired their paper? His answer was "no". If TEM was developed without pre-knowledge of transformers, then it's a very deep result IMHO.
[1] https://www.sciencedirect.com/science/article/pii/S009286742...
[2] https://www.youtube.com/watch?v=cufOEzoVMVA&t=1254s
x1000 2 years ago
There’s a video[1] of Karpathy recounting an email correspondence he had with with Bahdanau. The email explains that the word “Attention” comes from Bengio who, in one of his final reviews of the paper, determined it to be preferable to Bahdanau’s original idea of calling it “RNNSearch”.
[1] https://youtu.be/XfpMkf4rD6E?t=18m23s
- behnamoh 2 years ago
  
  "RNNSearch is all you need" probably wouldn't catch on and we'd still be ChatGPT-less.
  
  2 replies →
Me1000 2 years ago

Not OP and have no insight, but the thing that caused it to click for me was when I heard “this token attends to that token”. Basically, there’s a new value created that represents how much one thing (in an LLM its tokens) cares about another thing.
Saying “attends to” vs “attention” helped clarify (for me) the mechanics of what’s going on.
casualscience 2 years ago

An attention layer transforms word vectors by adding information from the other words in the sequence. The amount of information added from each neighboring word is regulated by a weight called the "attention weight". If the attention weight for one of the neighbors is enormously large, then all the information added will be from that word, in contrast, if the attention weight for a neighbor is zero, it will add no information to the word. This is called an 'attention mechanism' since it literally decides which information to pass through the network, i.e. which other words should the model 'pay attention to' when it is considering a particular word.
davidguetta 2 years ago

Mm attention as used in earlier papers makes a lot of more sense with respect to the term... there was several where it was literally used to focus on some part of an image at a higher resolution for example.

_giorgio_ 2 years ago

> the paper that was most helpful for me was [Formal Algorithms for Transformers](https://arxiv.org/abs/2207.09238)

Interesting but hard to read since it uses a quite unique notations for matrix indexing and multplication. Why???

mugivarra69 2 years ago

jakob and ashish were great :)