Comment by p1esk
2 days ago
The way I think about QKV projections: Q defines sensitivity of token i features when computing similarity of this token to all other tokens. K defines visibility of token j features when it’s selected by all other tokens. V defines what features are important when doing weighted sum of all tokens.
Don't get caught up in interpreting QKV, it is a waste of time, since completely different attention formulations (e.g. merged attention [1]) still give you the similarities / multiplicative interactions, but may even work better [2]. EDIT: Oh and attention is much more broad than scaled dot-product attention [3].
[1] https://www.emergentmind.com/topics/merged-attention
[2] https://blog.google/innovation-and-ai/technology/developers-...
[3] https://arxiv.org/abs/2111.07624
I glanced at these links and it seems that all these attention variants still use QKV projections.
Do you see any issues with my interpretation of them?
Read the third link / review paper, it is not at all the case that all attention is based on QKV projections.
Your terms "sensitivity", "visibility", and "important" are too vague and lack any clear mathematical meaning, so IMO add nothing to any understanding. "Important" also seems factually wrong, given these layers are stacked, so later weights and operations can in fact inflate / reverse things. Deriving e.g. feature importances from self-attention layers remains a highly disputed area (e.g. [1] vs [2], for just the tip of the iceberg).
You are also assuming that the importance of attention is the highly-specific QKV structure and projection, but there is very little reason to believe that based on the third review link I shared. Or, if you'd like another example of why not to focus so much on scaled dot-product attention, see that it is just a subset of a broader category of multiplicative interactions (https://openreview.net/pdf?id=rylnK6VtDH).
[1] Attention is not Explanation - https://arxiv.org/abs/1902.10186
[2] Attention is not not Explanation - https://arxiv.org/abs/1908.04626
6 replies →