Comment by D-Machine

2 months ago

Don't get caught up in interpreting QKV, it is a waste of time, since completely different attention formulations (e.g. merged attention [1]) still give you the similarities / multiplicative interactions, but may even work better [2]. EDIT: Oh and attention is much more broad than scaled dot-product attention [3].

[1] https://www.emergentmind.com/topics/merged-attention

[2] https://blog.google/innovation-and-ai/technology/developers-...

[3] https://arxiv.org/abs/2111.07624

8 comments

D-Machine

p1esk 2 months ago

I glanced at these links and it seems that all these attention variants still use QKV projections.

Do you see any issues with my interpretation of them?

D-Machine 2 months ago
Read the third link / review paper, it is not at all the case that all attention is based on QKV projections.
Your terms "sensitivity", "visibility", and "important" are too vague and lack any clear mathematical meaning, so IMO add nothing to any understanding. "Important" also seems factually wrong, given these layers are stacked, so later weights and operations can in fact inflate / reverse things. Deriving e.g. feature importances from self-attention layers remains a highly disputed area (e.g. [1] vs [2], for just the tip of the iceberg).
You are also assuming that the importance of attention is the highly-specific QKV structure and projection, but there is very little reason to believe that based on the third review link I shared. Or, if you'd like another example of why not to focus so much on scaled dot-product attention, see that it is just a subset of a broader category of multiplicative interactions (https://openreview.net/pdf?id=rylnK6VtDH).
[1] Attention is not Explanation - https://arxiv.org/abs/1902.10186
[2] Attention is not not Explanation - https://arxiv.org/abs/1908.04626
- p1esk 2 months ago
  
  1. The two papers you linked are about importance of attention weights, not QKV projections. This is orthogonal to our discussion.
  2. I don't see how the transformations done in one attention block can be reversed in the next block (or in the FFN network immediately after the first block): can you please explain?
  3. All state of the art open source LLMs (DeepSeek, Qwen, Kimi, etc) still use all three QKV projections, and largely the same original attention algorithm with some efficiency tweaks (grouped query, MLA, etc) which are done strictly to make the models faster/lighter, not smarter.
  4. When GPT2 came out, I myself tried to remove various ops from attention blocks, and evaluated the impact. Among other things I tried removing individual projections (using unmodified input vectors instead), and in all three cases I observed quality degradation (when training from scratch).
  5. The terms "sensitivity", "visibility", and "important" all attempt to describe feature importance when performing pattern matching. I use these terms in the same sense as importance of features matched by convolutional layer kernels, which scan the input image and match patterns.
  
  5 replies →