← Back to context

Comment by p1esk

2 days ago

1. The two papers you linked are about importance of attention weights, not QKV projections. This is orthogonal to our discussion.

2. I don't see how the transformations done in one attention block can be reversed in the next block (or in the FFN network immediately after the first block): can you please explain?

3. All state of the art open source LLMs (DeepSeek, Qwen, Kimi, etc) still use all three QKV projections, and largely the same original attention algorithm with some efficiency tweaks (grouped query, MLA, etc) which are done strictly to make the models faster/lighter, not smarter.

4. When GPT2 came out, I myself tried to remove various ops from attention blocks, and evaluated the impact. Among other things I tried removing individual projections (using unmodified input vectors instead), and in all three cases I observed quality degradation (when training from scratch).

5. The terms "sensitivity", "visibility", and "important" all attempt to describe feature importance when performing pattern matching. I use these terms in the same sense as importance of features matched by convolutional layer kernels, which scan the input image and match patterns.

> and in all three cases I observed quality degradation (when training from scratch).

At the same model size and training FLOPS?

  • No. Each projection is ~5% of total FLOPs/params. Not enough model capacity change to care. From what I remember, removing one of them was worse than other two, I think it was Q. But in all three cases, degradation (in both loss and perplexity) was significant.

1. I do not think it is orthogonal, but, regardless, there is plenty of research trying to get explainability out of all aspects of scaled dot-product attention layers (weights, QKV projections, activations, other aspects), and trying to explain deep models generally via sort of bottom-up mechanistic approaches. I think it can be clearly argued this does not give us much and is probably a waste of time (see e.g. https://ai-frontiers.org/articles/the-misguided-quest-for-me...). I think this is especially clear when you have evidence (in research, at least) that other mechanisms and layers can produce highly similar results.

2. I didn't say the transformations can be reversed, I said if you interpret anything as an importance (e.g. a magnitude), that can be inflated / reversed by whatever weights are learned by later layers. Negative values and/or weights make this even more annoying / complicated.

3. Not sure how this is relevant, but, yes, any reasons for caring about QKV and scaled dot-product attention specifics are mostly related to performance and/or current popular leading models. But there is nothing fundamentally important about scaled dot-product attention, it most likely just happens to be something that was settled on prematurely because it works quite well and is easy to parallelize. Or, if you like the kernel smoothing explanation also mentioned in this thread, scaled dot-product self-attention implements something very similar to a particularly simple and nice form of kernel smoothing.

4. Yup, removing ops from scaled dot-product attention blocks is going to dramatically reduce expressivity, because there really aren't much ops there to remove. But there is enough work on low-rank attention, linear attentions, and sparse attentions, that show you can remove a lot of expressivity and still do quite well. And, of course, the huge amount of helpful other types of attention I linked before give gains in some cases too. You should be skeptical about any really simple or clear story about what is going on here. In particular, there is no clear reason why a small hypernetwork couldn't be used to approximate something more general than scaled dot-product attention, except that, obviously this is going to be more expensive, and in practice you can probably just get the same approximate flexibility by stacking simpler attention layers.

5. I still find that doesn't give me any clear mathematical meaning.

I suspect our learning goals are at odds. If you want to focus solely on the very specific kind of attention used in the popular transformer models today, perhaps because you are interested in optimizations or distillation or something, then by all means try to come up with special intuitions about Q, K, and V, if you think that will help here. But those intuitions will likely not translate well to future and existing modifications and improvements to attention layers, in transformers or otherwise. You will be better served learning about attention broadly and developing intuitions based on that.

Others have mentioned the kernel smoothing interpretation, and I think multiplicative interactions are the clearer deeper generalization of what is really important and valuable here. Also, the useful intuitions in DL have been less about e.g. "feature importances" and "sensitivity" and such, but tend to come more from linear algebra and calculus, and tend to involve things like matrix conditioning and regularization / smoothing and Lipschitz constants and the like. In particular, the softmax in self-attention is probably not doing what people typically say it does (https://arxiv.org/html/2410.18613v1), and the real point is that all these attention layers are trained in an end-to-end fashion where all layers are interdependent on each other to varying complicated degrees. Focusing on very specific interpretations ("Q is this, K is that"), especially where these interpretations are sort of vaguely metaphorical, like yours, is not likely to result in much deep understanding, in my opinion.

  • Per your point 4, some current hyped work is pushing hard in this direction [1, 2, 3]. The basic idea is to think of attention as a way of implementing an associative memory. Variants like SDPA or gated linear attention can then be derived as methods for optimizing this memory online such that a particular query will return a particular value. Different attention variants correspond to different ways of defining how the memory produces a value in response to a query, and how we measure how well the produced value matches the desired value.

    Some of the attention-like ops proposed in this new work are most simply described as implementing the associative memory with a hypernetwork that maps keys to values with weights that are optimized at test time to minimize value retrieval error. Like you suggest, designing these hypernetworks to permit efficient implementations is tricky.

    It's a more constrained interpretation of attention than you're advocating for, since it follows the "attention as associative memory" perspective, but the general idea of test-time optimization could be applied to other mechanisms for letting information interact non-linearly across arbitrary nodes in the compute graph.

    [1] https://arxiv.org/abs/2501.00663

    [2] https://arxiv.org/abs/2504.13173

    [3] https://arxiv.org/abs/2505.23735

  • perhaps because you are interested in optimizations or distillation or something

    Yes, my job is model compression: quantization, pruning, factorization, ops fusion/approximation/caching, in the context of hw/sw codesign.

    In general, I agree with you that simple intuitions often break down in DL - I observed it many times. I also agree that we don't have good understanding how these systems work. Hopefully this situation is more like pre-Newtonian physics, and Newtons are coming.