← Back to context

Comment by Atheb

2 days ago

> how in explanations of attention the Q, K, V matrices always seem to be pulled out of a hat after being motivated in a hand-wavy metaphorical way.

Justin Johnson's lecture on Attention [1] mechanisms really helped me understand the concept of attention in transformers. In the lecture he goes through the history and and iterations of attention mechanisms, from CNNs and RNNs to Transformers, while keeping the notation coherent and you get to see how and when in the literature the QKV matrices appear. It's an hour long but it's IMO a must watch for anyone interested in the topic.

[1]: https://www.youtube.com/watch?v=YAgjfMR9R_M