Comment by albertzeyer

2 years ago

The audience of this paper are other researchers who already know the concept of attention, which was very well known already in the field. In such research papers, such things are never explained again, as all the researchers already know this or can read other sources, which are cited, but focus on the actual research questions. In this case, the research question was simply: Can we get away by just using attention and not using the LSTM anymore? Before that, everyone was using both together.

I think learning it following it more this historical development can be helpful. E.g. in this case here, learn the concept of attention, specifically cross attention first. And that is this paper: Bahdanau, Cho, Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate", 2014, https://arxiv.org/abs/1409.0473

That paper introduces it. But even that is maybe quite dense, and to really grasp it, it helps to reimplement those things.

It's always dense, because those papers already have space constraints given by the conferences, max 9 pages or so. To get a better detailed overview, you can study the authors code, or other resources. There is a lot now about those topics, whole books, etc.

5 comments

albertzeyer

BOOSTERHIDROGEN 2 years ago

What books cover exclusively about this topic ? Thanks

albertzeyer 2 years ago
This is frequently a topic here on HN. E.g.:
https://news.ycombinator.com/item?id=35543774
There is a lot more. Just google for "deep learning", and you'll find a lot of content. And most of that will cover attention, as it is a really basic concept now.
- lubesGordi 2 years ago
  
  Thanks for the udl book (Understanding Deep Learning), that looks like a really great starting point.
stevenbedrick 2 years ago

To add to the excellent resources that have already been posted, Chapter 9 of Jurafsky and Martin's "Speech and Language Processing" has a nice overview of attention, and the next chapter talks specifically about the Transformer architecture: https://web.stanford.edu/~jurafsky/slp3/
CardenB 2 years ago

I doubt any.