Comment by libraryofbabel

2 months ago

This is ok (could use some diagrams!), but I don't think anyone coming to this for the first time will be able to use it to really teach themselves the LLM attention mechanism. It's a hard topic and requires two or three book chapters at least if you really want to start grokking it!

For anyone serious about coming to grips with this stuff, I would strongly recommend Sebastian Raschka's excellent book Build a Large Language Model (From Scratch), which I just finished reading. It's approachable and also detailed.

As an aside, does anyone else find the whole "database lookup" motivation for QKV kind of confusing? (in the article, "Query (Q): What am I looking for? Key (K): What do I contain? Value (V): What information do I actually hold?"). I've never really got it and I just switched to thinking of QKV as a way to construct a fairly general series of linear algebra transformations on the input of a sequence of token embedding vectors x that is quadratic in x and ensures that every token can relate to every other token in the NxN attention matrix. After all, the actual contents and "meaning" of QKV are very opaque: the weights that are used to construct them are learned during training. Furthermore, there is a lot of symmetry between Q and K in the algebra, which gets broken only by the causal mask. Or do people find this motivation useful and meaningful in some deeper way? What am I missing?

[edit: on this last question, the article on "Attention is just Kernel Smoothing" that roadside_picnic posted below looks really interesting in terms of giving a clean generalized mathematical approach to this, and also affirms that I'm not completely off the mark by being a bit suspicious about the whole hand-wavy "database lookup" Queries/Keys/Values interpretation]

30 comments

libraryofbabel

D-Machine 2 months ago

> I've never really got it and I just switched to thinking of QKV as a way to construct a fairly general series of linear algebra transformations on the input of a sequence of token embedding vectors x that is quadratic in x and ensures that every token can relate to every other token in the NxN attention matrix.

That's because what you say here is the correct understanding. The lookup thing is nonsense.

The terms "Query" and "Value" are largely arbitrary and meaningless in practice, look at how to implement this in PyTorch and you'll see these are just weight matrices that implement a projection of sorts, and self-attention is always just self_attention(x, x, x) or self_attention(x, x, y) in some cases (e.g. cross-attention), where x and y are are outputs from previous layers.

Plus with different forms of attention, e.g. merged attention, and the research into why / how attention mechanisms might actually be working, the whole "they are motivated by key-value stores" thing starts to look really bogus. Really it is that the attention layer allows for modeling correlations/similarities and/or multiplicative interactions among a dimension-reduced representation. EDIT: Or, as you say, it can be regarded as kernel smoothing.

libraryofbabel 2 months ago
Thanks! Good to know I’m not missing something here. And yeah, it’s always just seemed to me better to frame it as: let’s find a mathematical structure to relate every embedding vector in a sequence to every other vector, and let’s throw in a bunch of linear projections so that there are lots of parameters to learn during training to make the relationship structure model things from language, concepts, code, whatever.
I’ll have to read up on merged attention, I haven’t got that far yet!
- D-Machine 2 months ago
  
  The main takeaway is that "attention" is a much broader concept generally, so worrying too much about the "scaled dot-product attention" of transformers deeply limits your understanding of what kinds of things really matter in general.
  A paper I found particularly useful on this was generalizing even farther to note the importance of multiplicative interactions more generally in deep learning (https://openreview.net/pdf?id=rylnK6VtDH).
  EDIT: Also, this paper I was looking for dramatically generalizes the notion of attention in a way I found to be quite helpful: https://arxiv.org/pdf/2111.07624

ianand 2 months ago

I'm not a fan of the database lookup analogy either.

The analogy I prefer when teaching attention is celestial mechanics. Tokens are like planets in (latent) space. The attention mechanism is like a kind of "gravity" where each token is influencing each other, pushing and pulling each other around in (latent) space to refine their meaning. But instead of "distance" and "mass", this gravity is proportional to semantic inter-relatedness and instead of physical space this is occurring in a latent space.

https://www.youtube.com/watch?v=ZuiJjkbX0Og&t=3569s

skinner_ 2 months ago

Then I think you’ll like our project which aims to find the missing link between transformers and swarm simulations:
https://github.com/danielvarga/transformer-as-swarm
Basically a boid simulation where a swarm of birds can collectively solve MNIST. The goal is not some new SOTA architecture, it is to find the right trade-off where the system already exhibits complex emergent behavior while the swarming rules are still simple.
It is currently abandoned due to a serious lack of free time (*), but I would consider collaborating with anyone willing to put in some effort.
(*) In my defense, I’m not slacking meanwhile: https://arxiv.org/abs/2510.26543 https://arxiv.org/abs/2510.16522 https://www.youtube.com/watch?v=U5p3VEOWza8
art_mach 2 months ago

This is an excellent analogy! Thank you!

p1esk 2 months ago

The way I think about QKV projections: Q defines sensitivity of token i features when computing similarity of this token to all other tokens. K defines visibility of token j features when it’s selected by all other tokens. V defines what features are important when doing weighted sum of all tokens.

D-Machine 2 months ago
Don't get caught up in interpreting QKV, it is a waste of time, since completely different attention formulations (e.g. merged attention [1]) still give you the similarities / multiplicative interactions, but may even work better [2]. EDIT: Oh and attention is much more broad than scaled dot-product attention [3].
[1] https://www.emergentmind.com/topics/merged-attention
[2] https://blog.google/innovation-and-ai/technology/developers-...
[3] https://arxiv.org/abs/2111.07624
- p1esk 2 months ago
  
  I glanced at these links and it seems that all these attention variants still use QKV projections.
  Do you see any issues with my interpretation of them?
  
  7 replies →

mnicky 2 months ago

IIRC isn't the symmetry between Q and K also broken by the direction of the softmax? I mean, row vs column-wise application yields different interpretation.

ebonnafoux 2 months ago
Yes but in practice, if you compute K=X.wk, Q=X.wq and then K.tQ you make three matrice multiplication. Wouldn't be faster to compute W=wk.twq beforhand and then just X.W.tX which will be just two matrices multiplication ? Is there something I am missing ?
- yorwba 2 months ago
  
  Most models have a per-head dimension much smaller than the input dimension, so it's faster to multiply by the small wk and wk individually than to multiply by the large matrix W. Also, if you use rotary positional embeddings, the RoPE matrices need to be sandwiched in the middle and they're different for every token, so you could no longer premultiply just once.
libraryofbabel 2 months ago
Oh yes! That's probably more important, in fact.
- mnicky 2 months ago
  
  Well, I think that this is also answer to your question about the intuition.
  If the assymetry of K and Q stems from the direction of the softmax application, it must also be the reason for the names of the matrices :)
  And if you think about it, it makes sense that for each Key, weights to all of the Queries sum to 1 and not vice versa.
  So this is my only intuition for the K and Q names.
  (It may or may not be similar to the whole "db lookup thing"... I just don't use that one.)

andoando 2 months ago

I find it really confusing as well. The analogy implies we have something like Q[K] = V

For one, I have no idea how this relates to the mathematical operations of calculating attention score, applying softmax and than doing dot product with the V matrix.

Second just conceptually I don't understand how this relates to the "a word looks up to how relevant it is to another word". So if you have "The cat eats his soup", "his" queries how it's important it is to cat. So is V just numerical result of the significance, like 0.99?

I dont think Im very stupid but after seeing a dozens of these, I am starting to wonder if anyone actually understands this conceptually

empiricus 2 months ago

Not sure how helpful it is, but: Words or concepts are represented as high-dim vectors. At high level, we could say each dimension is another concept like "dog"-ness or "complexity" or "color"-ness. The "a word looks up to how relevant it is to another word" is basically just relevance=distance=vector dot product. and the dot product can be distorted="some directions are more important" for one purpose or another(q/k/v matrixes distort the dot product). softmax is just a form of normalization (all sums to 1 = proper probability). The whole shebang works only because all pieces can be learned by gradient descent, otherwise it would be impossible to implement.

ebbi 2 months ago

Does that book require some sort of technical prerequisite to understand?

libraryofbabel 2 months ago
It helps if you have some basic linear algebra, for sure - matrices, vectors, etc. That's probably the most important thing. You don't need to know pytorch, which is introduced in the book as needed and in an appendix. If you want to really understand the chapters on pre-training and fine-tuning you'll need to know a bit of machine learning (like a basic grasp of loss functions and gradient descent and backpropagation - it's sort of explained in the book but I don't think I'd have understood it much without having trained basic neural networks before), but that is not required so much for the earlier chapters on the architecture, e.g. how the attention mechanism works with Q, K, V as discussed in this article.
The best part about it is seeing the code built up for the GPT-2 architecture in basic pytorch, and then loading in the real GPT-2 weights and they actually work! So it's great for learning but also quite realistic. It's LLM architecture from a few years ago (to keep it approachable), but Sebastian has some great more advanced material on modern LLM architectures (which aren't that different) on his website and in the github repo: e.g. he has a whole article on implementing the Qwen3 architecture from scratch.
- kouteiheika 2 months ago
  
  > modern LLM architectures (which aren't that different) on his website and in the github repo: e.g. he has a whole article on implementing the Qwen3 architecture from scratch.
  This might be underselling it a little bit. The difference between GPT2 and Qwen3 is maybe, I don't know, ~20 lines of code difference if you write it well? The biggest difference is probably RoPE (which can be tricky to wrap your head around); the rest is pretty minor.
  
  1 reply →
- ebbi 2 months ago
  
  Thank you! Might get the book to see what I can learn from it, and see what gaps I have to research and learn more. Appreciate the detailed response.
  
  1 reply →