I will beat loudly on the "Attention is a reinvention of Kernel Smoothing" drum until it is common knowledge. It looks like Cosma Schalizi's fantastic website is down for now, so here's a archive link to his essential reading on this topic [0].
If you're interested in machine learning at all and not very strong regarding kernel methods I highly recommending taking a deep dive. Such a huge amount of ML can be framed through the lens of kernel methods (and things like Gaussian Processes will become much easier to understand).
This is really useful, thanks. In my other (top-level) comment, I mentioned some vague dissatisfactions around how in explanations of attention the Q, K, V matrices always seem to be pulled out of a hat after being motivated in a hand-wavy metaphorical way. The kernel methods treatment looks much more mathematically general and clean - although for that reason maybe less approachable without a math background. But as a recovering applied mathematician ultimately I much prefer a "here is a general form, now let's make some clear assumptions to make it specific" to a "here's some random matrices you have to combine in a particular way by murky analogy to human attention and databases."
I'll make a note to read up on kernels some more. Do you have any other reading recommendations for doing that?
> how in explanations of attention the Q, K, V matrices always seem to be pulled out of a hat after being motivated in a hand-wavy metaphorical way.
Justin Johnson's lecture on Attention [1] mechanisms really helped me understand the concept of attention in transformers. In the lecture he goes through the history and and iterations of attention mechanisms, from CNNs and RNNs to Transformers, while keeping the notation coherent and you get to see how and when in the literature the QKV matrices appear. It's an hour long but it's IMO a must watch for anyone interested in the topic.
> Such a huge amount of ML can be framed through the lens of kernel methods
And none of them are a reinvention of kernel methods. There is such a huge gap between the Nadaraya and Watson idea and a working Attention model, calling it a reinvention is quite a reach.
One might as well say that neural networks trained with gradient descent are a reinvention of numerical methods for function approximation.
> One might as well say that neural networks trained with gradient descent are a reinvention of numerical methods for function approximation.
I don't know anyone who would disagree with that statement, and this is the standard framing I've encountered in nearly all neural network literature and courses. If you read any of the classic gradient based papers they fundamentally assume this position. Just take a quick read of "A Theoretical Framework for Back-Propagation (LeCun, 1988)" [0], here's a quote from the abstract:
> We present a mathematical framework for studying back-propagation based on the Lagrangian formalism. In this framework, inspired by optimal control theory, back-propagation is formulated as an optimization problem with nonlinear constraints.
There's no way you can read that a not recognize that you're reading a paper on numerical methods for function approximation.
The issue is that Vaswani, et al never mentions this relationship.
I don't understand what motivate the need for w1 and w2, except if we accept the premise that we are doing attention in the query and key spaces... Which is not the thesis of the author. What am I missing?
Surprisingly, reading this piece helped me better understand the query, key metaphor.
It's utterly baffling to me that there hasn't been more SOTA machine learning research on Gaussian processes with the kernels inferred via deep learning. It seems a lot more flexible than the primitive, rigid dot product attention that has come to dominate every aspect of modern AI.
I think this mostly comes down to (multi-headed) scaled dot-product attention just being very easy to parallelize on GPUs. You can then make up for the (relative) lack of expressivity / flexibility by just stacking layers.
Abstract: We propose an extension of the decoder Transformer that conditions its generative process on random latent variables which are learned without supervision thanks to a variational procedure. Experimental evaluations show that allowing such a conditioning translates into substantial improvements on downstream tasks.
In addition to what others say said, computational complexity, is a big reason. Gaussian Process and Kernelized SVM have fit complexities of O(n^2) to O(n^3) (where n is the # of samples, also using optimal solutions and not approximations). While Neural Nets and Tree Ensembles are O(n).
I think datasets with lots of samples tend to be very common (such as training on huge text datasets like LLMs do). In my travels most datasets for projects tend to be on the larger side (10k+ samples).
I think they tried it already in the original transformer paper. THe results were not worth implementing.
From the paper(where Additive attention is the other "similarity function"):
Additive attention computes the compatibility function using a feed-forward network with
a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is
much faster and more space-efficient in practice, since it can be implemented using highly optimized
matrix multiplication code.
This is ok (could use some diagrams!), but I don't think anyone coming to this for the first time will be able to use it to really teach themselves the LLM attention mechanism. It's a hard topic and requires two or three book chapters at least if you really want to start grokking it!
For anyone serious about coming to grips with this stuff, I would strongly recommend Sebastian Raschka's excellent book Build a Large Language Model (From Scratch), which I just finished reading. It's approachable and also detailed.
As an aside, does anyone else find the whole "database lookup" motivation for QKV kind of confusing? (in the article, "Query (Q): What am I looking for? Key (K): What do I contain? Value (V): What information do I actually hold?"). I've never really got it and I just switched to thinking of QKV as a way to construct a fairly general series of linear algebra transformations on the input of a sequence of token embedding vectors x that is quadratic in x and ensures that every token can relate to every other token in the NxN attention matrix. After all, the actual contents and "meaning" of QKV are very opaque: the weights that are used to construct them are learned during training. Furthermore, there is a lot of symmetry between Q and K in the algebra, which gets broken only by the causal mask. Or do people find this motivation useful and meaningful in some deeper way? What am I missing?
[edit: on this last question, the article on "Attention is just Kernel Smoothing" that roadside_picnic posted below looks really interesting in terms of giving a clean generalized mathematical approach to this, and also affirms that I'm not completely off the mark by being a bit suspicious about the whole hand-wavy "database lookup" Queries/Keys/Values interpretation]
> I've never really got it and I just switched to thinking of QKV as a way to construct a fairly general series of linear algebra transformations on the input of a sequence of token embedding vectors x that is quadratic in x and ensures that every token can relate to every other token in the NxN attention matrix.
That's because what you say here is the correct understanding. The lookup thing is nonsense.
The terms "Query" and "Value" are largely arbitrary and meaningless in practice, look at how to implement this in PyTorch and you'll see these are just weight matrices that implement a projection of sorts, and self-attention is always just self_attention(x, x, x) or self_attention(x, x, y) in some cases (e.g. cross-attention), where x and y are are outputs from previous layers.
Plus with different forms of attention, e.g. merged attention, and the research into why / how attention mechanisms might actually be working, the whole "they are motivated by key-value stores" thing starts to look really bogus. Really it is that the attention layer allows for modeling correlations/similarities and/or multiplicative interactions among a dimension-reduced representation. EDIT: Or, as you say, it can be regarded as kernel smoothing.
Thanks! Good to know I’m not missing something here. And yeah, it’s always just seemed to me better to frame it as: let’s find a mathematical structure to relate every embedding vector in a sequence to every other vector, and let’s throw in a bunch of linear projections so that there are lots of parameters to learn during training to make the relationship structure model things from language, concepts, code, whatever.
I’ll have to read up on merged attention, I haven’t got that far yet!
I'm not a fan of the database lookup analogy either.
The analogy I prefer when teaching attention is celestial mechanics. Tokens are like planets in (latent) space. The attention mechanism is like a kind of "gravity" where each token is influencing each other, pushing and pulling each other around in (latent) space to refine their meaning. But instead of "distance" and "mass", this gravity is proportional to semantic inter-relatedness and instead of physical space this is occurring in a latent space.
Basically a boid simulation where a swarm of birds can collectively solve MNIST. The goal is not some new SOTA architecture, it is to find the right trade-off where the system already exhibits complex emergent behavior while the swarming rules are still simple.
It is currently abandoned due to a serious lack of free time (*), but I would consider collaborating with anyone willing to put in some effort.
The way I think about QKV projections: Q defines sensitivity of token i features when computing similarity of this token to all other tokens. K defines visibility of token j features when it’s selected by all other tokens. V defines what features are important when doing weighted sum of all tokens.
Don't get caught up in interpreting QKV, it is a waste of time, since completely different attention formulations (e.g. merged attention [1]) still give you the similarities / multiplicative interactions, but may even work better [2]. EDIT: Oh and attention is much more broad than scaled dot-product attention [3].
IIRC isn't the symmetry between Q and K also broken by the direction of the softmax? I mean, row vs column-wise application yields different interpretation.
Yes but in practice, if you compute K=X.wk, Q=X.wq and then K.tQ you make three matrice multiplication.
Wouldn't be faster to compute W=wk.twq beforhand and then just X.W.tX which will be just two matrices multiplication ?
Is there something I am missing ?
I find it really confusing as well. The analogy implies we have something like Q[K] = V
For one, I have no idea how this relates to the mathematical operations of calculating attention score, applying softmax and than doing dot product with the V matrix.
Second just conceptually I don't understand how this relates to the "a word looks up to how relevant it is to another word". So if you have "The cat eats his soup", "his" queries how it's important it is to cat. So is V just numerical result of the significance, like 0.99?
I dont think Im very stupid but after seeing a dozens of these, I am starting to wonder if anyone actually understands this conceptually
Not sure how helpful it is, but:
Words or concepts are represented as high-dim vectors. At high level, we could say each dimension is another concept like "dog"-ness or "complexity" or "color"-ness. The "a word looks up to how relevant it is to another word" is basically just relevance=distance=vector dot product. and the dot product can be distorted="some directions are more important" for one purpose or another(q/k/v matrixes distort the dot product). softmax is just a form of normalization (all sums to 1 = proper probability). The whole shebang works only because all pieces can be learned by gradient descent, otherwise it would be impossible to implement.
It helps if you have some basic linear algebra, for sure - matrices, vectors, etc. That's probably the most important thing. You don't need to know pytorch, which is introduced in the book as needed and in an appendix. If you want to really understand the chapters on pre-training and fine-tuning you'll need to know a bit of machine learning (like a basic grasp of loss functions and gradient descent and backpropagation - it's sort of explained in the book but I don't think I'd have understood it much without having trained basic neural networks before), but that is not required so much for the earlier chapters on the architecture, e.g. how the attention mechanism works with Q, K, V as discussed in this article.
The best part about it is seeing the code built up for the GPT-2 architecture in basic pytorch, and then loading in the real GPT-2 weights and they actually work! So it's great for learning but also quite realistic. It's LLM architecture from a few years ago (to keep it approachable), but Sebastian has some great more advanced material on modern LLM architectures (which aren't that different) on his website and in the github repo: e.g. he has a whole article on implementing the Qwen3 architecture from scratch.
One of the big problems with Attention Mechanisms is that the Query needs to look over every single key, which for long contexts becomes very expensive.
A little side project I've been working on is to train a model that sits on top of the LLM, looks at each key and determines whether it's needed after a certain lifespan, and evicts it if possible (after the lifespan is expired). Still working on it, but my first pass test has a reduction of 90% of the keys!
QKV attention is just a probabilistic lookup table where QKV allow adjusting dimensions of input/output to fit into your NN block. If your Q perfectly matches some known K (from training) then you get the exact V otherwise you get some linear combination of all Vs weighted by the attention.
These metaphorical database analogies bug me, and from what it seems like, a lot of other people in comments! So far some of the most reasonable explanations I have found that take training dynamics into account are from Lenka Zdeborova's lab (albeit in toy, linear attention settings but it's easy to see why they generalize to practical ones). For instance, this is a lovely paper: https://arxiv.org/abs/2509.24914
I published a video that explains Self-Attention and Multi-head attention in a different way -- going from intuition, to math, to code starting from the end-result and walking backward to the actual method.
Hopefully this sheds light on this important topic in a way that is different than other approaches and provides the clarity needed to understand Transformer architecture. It starts at 41:22 in the below video.
The confusing thing about attention in this article (and the famous "Attention is all you need" paper it's derived from) is the heavy focus on self-attention. In self-attention, Q/K/V are all derived from the same input tokens, so it's confusing to distinguish their respective purposes.
I find attention much easier to understand in the original attention paper [0], which focuses on cross-attention for machine translation. In translation, the input sentence to be translated is tokenized into vectors {x_1...x_n}. The translated sentence is autoregressively generated into tokens {y_1...y_m}. To generate y_j, the model computes a similarity score of the previously generated token y_{j-1} against every x_i via the dot product s_{i,j} = x_i*K*y_{j-1}, transformed by the Key matrix. These are then softmaxed to create a weight vector a_j = softmax_i(s_{i,j}). The weighted average of X = [x_1|...|x_n] is taken with respect to a_j and transformed by the Value matrix, i.e. c_j = V*X*a_j. c_j is then passed to additional network layers to generate the output token y_j.
tl;dr: given the previous output token, compute its similarity to each input token (via K). Use those similarity scores to compute a weighted average across all input tokens, and use that weighted average to generate the next output token (via V).
Note that in this paper, the Query matrix is not explicitly used. It can be thought of as a token preprocessor: rather than computing s_{i,j} = x_i*K*y_{j-1}, each x_i is first linearly transformed by some matrix Q. Because this paper used an RNN (specifically, an LSTM) to encode the tokens, such transformations on the input tokens are implicit in each LSTM module.
Very much this, cross attention and the x, y notation makes the similarity / covariance matrix far more clear and intuitive.
Also forget the terms "query", "key" and "value", or vague analogies to key-value stores, that is IMO a largely false analogy, and certainly not a helpful way to understand what is happening.
100% agreed. Attention finally clicked for me when I realized "wait, it's just a transformed, weighted dot product and has nothing to do with key/value lookups." I would have gotten this a lot faster had they called the key matrix \Sigma.
I think of it more from an information retrieval (i.e. search) perspective.
Imagine the input text as though it were the whole internet and each page is just 1 token. Your job is to build a neural-network Google results page for that mini internet of tokens.
In traditional search, we are given a search query, and we want to find web pages via an intermediate search results page with 10 blue links. Basically, when we're Googling something, we want to know "What web pages are relevant to this given search query?", and then given those links we ask "what do those web pages actually say?" and click on the links to answer our question. In this case, the "Query" is obviously the user search query, the "Key" is one of the ten blue links (usually the title of the page), and the "Value" is the content of the web page that link goes to.
In the attention mechanism, we are given a token and we want to find its meaning when contextualized with other tokens. Basically, we are first trying to answer the question "which other tokens are relevant to this token?", and then given the answer to that we ask "what is the meaning of the original token given these other relevant tokens?" The "Query" is a given token in the input text, the "Key" is another token in the input text, and the "Value" is the final meaning of the original token with that other token in context (in the form of an embedding). For a given token, you can imagine it is as though the attention mechanism "clicked the 10 blue links" of the other most relevant tokens in the input and combined them in some way to figure out the meaning of the original query token (and also you might imagine we ran such a query in parallel for every token in the input text at the same time).
So the self attention mechanism is basically google search but instead of a user query, it's a token in the input, instead of a blue link, it's another token, and instead of a web page, it's meaning.
Read through my comments and those of others in this thread, the way you are thinking here is metaphorical and so disconnected from the actual math as to be unhelpful. It is not that case that you can gain a meaningful understanding of deep networks by metaphor. You actually need to learn some very basic linear algebra.
Heck, attention layers never even see tokens. Even the first self-attention layer sees positional embeddings, but all subsequent attention layers are just seeing complicated embeddings that are a mish-mash of the previous layers' embeddings.
I really enjoyed this relevant article about prompt caching where the author explained some of the same principles and used some additional visuals, though the main point there was why KV cache hits makes your LLM API usage much cheaper: https://ngrok.com/blog/prompt-caching/
"When we read a sentence like “The cat sat on the mat because it was comfortable,” our brain automatically knows that “it” refers to “the mat” and not “the cat.” "
Am I the only one who thinks it's not obvious the "it" refers to the mat? The cat could be sitting on the mat because the cat is comfortable
You are correct. This is pronoun ambiguity. I also immediately noticed it and was displeased to see it as the opener of the article. As in, I no longer expected correctness of anything else the author would write (I wouldn't normally be so harsh, but this is about text processing. Being correct about simple linguistic cases is critical)
For anyone interested, the textbook example would be:
> "The trophy would not fit in the suitcase because it was too big."
"it" may refer to either the suitcase or the trophy. It is reasonable here to assume "it" refers to the trophy being too large, as that makes the sentence logically valid. But change the sentence to
> "The trophy would not fit in the suitcase because it was too small."
Why would the cat being comfortable make it sit on a mat?
Many sentences require you to have some knowledge of the world to process. In this case, you need to have the knowledge that "being comfortable dictates where you sit" doesn't happen nearly as often as "where you sit dictates your comfort."
Even for humans NLP is probabilistic, which is why we still often get it wrong. Or at least I know that I do.
Ah, but cats won't just comfortably sit on a mat if they feel there is danger. They will only sit on a mat if they feel comfortable! Absent larger context, the sentence is in fact ambiguous (though I agree your reading is the most natural and obvious one).
I will beat loudly on the "Attention is a reinvention of Kernel Smoothing" drum until it is common knowledge. It looks like Cosma Schalizi's fantastic website is down for now, so here's a archive link to his essential reading on this topic [0].
If you're interested in machine learning at all and not very strong regarding kernel methods I highly recommending taking a deep dive. Such a huge amount of ML can be framed through the lens of kernel methods (and things like Gaussian Processes will become much easier to understand).
0. https://web.archive.org/web/20250820184917/http://bactra.org...
This is really useful, thanks. In my other (top-level) comment, I mentioned some vague dissatisfactions around how in explanations of attention the Q, K, V matrices always seem to be pulled out of a hat after being motivated in a hand-wavy metaphorical way. The kernel methods treatment looks much more mathematically general and clean - although for that reason maybe less approachable without a math background. But as a recovering applied mathematician ultimately I much prefer a "here is a general form, now let's make some clear assumptions to make it specific" to a "here's some random matrices you have to combine in a particular way by murky analogy to human attention and databases."
I'll make a note to read up on kernels some more. Do you have any other reading recommendations for doing that?
> how in explanations of attention the Q, K, V matrices always seem to be pulled out of a hat after being motivated in a hand-wavy metaphorical way.
Justin Johnson's lecture on Attention [1] mechanisms really helped me understand the concept of attention in transformers. In the lecture he goes through the history and and iterations of attention mechanisms, from CNNs and RNNs to Transformers, while keeping the notation coherent and you get to see how and when in the literature the QKV matrices appear. It's an hour long but it's IMO a must watch for anyone interested in the topic.
[1]: https://www.youtube.com/watch?v=YAgjfMR9R_M
https://arxiv.org/abs/2008.02217
They derive Q, K, V as a continuous analog of a hopfield network
That's kind of how applied ML is most of the time.
The neat chain of "this is how the math of it works" is constructed after the fact once you dialed in something and proven that it works. If ever.
> Such a huge amount of ML can be framed through the lens of kernel methods
And none of them are a reinvention of kernel methods. There is such a huge gap between the Nadaraya and Watson idea and a working Attention model, calling it a reinvention is quite a reach.
One might as well say that neural networks trained with gradient descent are a reinvention of numerical methods for function approximation.
> One might as well say that neural networks trained with gradient descent are a reinvention of numerical methods for function approximation.
I don't know anyone who would disagree with that statement, and this is the standard framing I've encountered in nearly all neural network literature and courses. If you read any of the classic gradient based papers they fundamentally assume this position. Just take a quick read of "A Theoretical Framework for Back-Propagation (LeCun, 1988)" [0], here's a quote from the abstract:
> We present a mathematical framework for studying back-propagation based on the Lagrangian formalism. In this framework, inspired by optimal control theory, back-propagation is formulated as an optimization problem with nonlinear constraints.
There's no way you can read that a not recognize that you're reading a paper on numerical methods for function approximation.
The issue is that Vaswani, et al never mentions this relationship.
0. http://yann.lecun.com/exdb/publis/pdf/lecun-88.pdf
1 reply →
Site is still fine (but is and was always http-only):
http://bactra.org/notebooks/nn-attention-and-transformers.ht...
In physics we call these things „duality“, depending on the problem one can choose different perspectives on the subject.
Things proven for one domain can than be pulled back to the other domain along the arrows of duality connections.
The archive link above is broken: this is an earlier archived copy of that page with content intact:
https://web.archive.org/web/20230713101725/http://bactra.org...
This might be the single best blog post I've ever read, both in terms of content and style.
Y'all should read this, and make sure you read to the end. The last paragraph is priceless.
I don't understand what motivate the need for w1 and w2, except if we accept the premise that we are doing attention in the query and key spaces... Which is not the thesis of the author. What am I missing?
Surprisingly, reading this piece helped me better understand the query, key metaphor.
Oh wow, I wish I could give more than one upvote for this reference!
It's utterly baffling to me that there hasn't been more SOTA machine learning research on Gaussian processes with the kernels inferred via deep learning. It seems a lot more flexible than the primitive, rigid dot product attention that has come to dominate every aspect of modern AI.
I think this mostly comes down to (multi-headed) scaled dot-product attention just being very easy to parallelize on GPUs. You can then make up for the (relative) lack of expressivity / flexibility by just stacking layers.
2 replies →
Doesn't involve Gaussians, but:
The Free Transformer: https://arxiv.org/abs/2510.17558
Abstract: We propose an extension of the decoder Transformer that conditions its generative process on random latent variables which are learned without supervision thanks to a variational procedure. Experimental evaluations show that allowing such a conditioning translates into substantial improvements on downstream tasks.
In addition to what others say said, computational complexity, is a big reason. Gaussian Process and Kernelized SVM have fit complexities of O(n^2) to O(n^3) (where n is the # of samples, also using optimal solutions and not approximations). While Neural Nets and Tree Ensembles are O(n).
I think datasets with lots of samples tend to be very common (such as training on huge text datasets like LLMs do). In my travels most datasets for projects tend to be on the larger side (10k+ samples).
1 reply →
I think they tried it already in the original transformer paper. THe results were not worth implementing.
From the paper(where Additive attention is the other "similarity function"):
Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
The Q, K, V matrices form neural networks at runtime, that's the entire point.
Yes, this needs to be linked more, you are doing a great service.
(How) do you find that framing enlightening?
Hey, can I contact you somehow?
This is ok (could use some diagrams!), but I don't think anyone coming to this for the first time will be able to use it to really teach themselves the LLM attention mechanism. It's a hard topic and requires two or three book chapters at least if you really want to start grokking it!
For anyone serious about coming to grips with this stuff, I would strongly recommend Sebastian Raschka's excellent book Build a Large Language Model (From Scratch), which I just finished reading. It's approachable and also detailed.
As an aside, does anyone else find the whole "database lookup" motivation for QKV kind of confusing? (in the article, "Query (Q): What am I looking for? Key (K): What do I contain? Value (V): What information do I actually hold?"). I've never really got it and I just switched to thinking of QKV as a way to construct a fairly general series of linear algebra transformations on the input of a sequence of token embedding vectors x that is quadratic in x and ensures that every token can relate to every other token in the NxN attention matrix. After all, the actual contents and "meaning" of QKV are very opaque: the weights that are used to construct them are learned during training. Furthermore, there is a lot of symmetry between Q and K in the algebra, which gets broken only by the causal mask. Or do people find this motivation useful and meaningful in some deeper way? What am I missing?
[edit: on this last question, the article on "Attention is just Kernel Smoothing" that roadside_picnic posted below looks really interesting in terms of giving a clean generalized mathematical approach to this, and also affirms that I'm not completely off the mark by being a bit suspicious about the whole hand-wavy "database lookup" Queries/Keys/Values interpretation]
> I've never really got it and I just switched to thinking of QKV as a way to construct a fairly general series of linear algebra transformations on the input of a sequence of token embedding vectors x that is quadratic in x and ensures that every token can relate to every other token in the NxN attention matrix.
That's because what you say here is the correct understanding. The lookup thing is nonsense.
The terms "Query" and "Value" are largely arbitrary and meaningless in practice, look at how to implement this in PyTorch and you'll see these are just weight matrices that implement a projection of sorts, and self-attention is always just self_attention(x, x, x) or self_attention(x, x, y) in some cases (e.g. cross-attention), where x and y are are outputs from previous layers.
Plus with different forms of attention, e.g. merged attention, and the research into why / how attention mechanisms might actually be working, the whole "they are motivated by key-value stores" thing starts to look really bogus. Really it is that the attention layer allows for modeling correlations/similarities and/or multiplicative interactions among a dimension-reduced representation. EDIT: Or, as you say, it can be regarded as kernel smoothing.
Thanks! Good to know I’m not missing something here. And yeah, it’s always just seemed to me better to frame it as: let’s find a mathematical structure to relate every embedding vector in a sequence to every other vector, and let’s throw in a bunch of linear projections so that there are lots of parameters to learn during training to make the relationship structure model things from language, concepts, code, whatever.
I’ll have to read up on merged attention, I haven’t got that far yet!
1 reply →
I'm not a fan of the database lookup analogy either.
The analogy I prefer when teaching attention is celestial mechanics. Tokens are like planets in (latent) space. The attention mechanism is like a kind of "gravity" where each token is influencing each other, pushing and pulling each other around in (latent) space to refine their meaning. But instead of "distance" and "mass", this gravity is proportional to semantic inter-relatedness and instead of physical space this is occurring in a latent space.
https://www.youtube.com/watch?v=ZuiJjkbX0Og&t=3569s
Then I think you’ll like our project which aims to find the missing link between transformers and swarm simulations:
https://github.com/danielvarga/transformer-as-swarm
Basically a boid simulation where a swarm of birds can collectively solve MNIST. The goal is not some new SOTA architecture, it is to find the right trade-off where the system already exhibits complex emergent behavior while the swarming rules are still simple.
It is currently abandoned due to a serious lack of free time (*), but I would consider collaborating with anyone willing to put in some effort.
(*) In my defense, I’m not slacking meanwhile: https://arxiv.org/abs/2510.26543 https://arxiv.org/abs/2510.16522 https://www.youtube.com/watch?v=U5p3VEOWza8
This is an excellent analogy! Thank you!
The way I think about QKV projections: Q defines sensitivity of token i features when computing similarity of this token to all other tokens. K defines visibility of token j features when it’s selected by all other tokens. V defines what features are important when doing weighted sum of all tokens.
Don't get caught up in interpreting QKV, it is a waste of time, since completely different attention formulations (e.g. merged attention [1]) still give you the similarities / multiplicative interactions, but may even work better [2]. EDIT: Oh and attention is much more broad than scaled dot-product attention [3].
[1] https://www.emergentmind.com/topics/merged-attention
[2] https://blog.google/innovation-and-ai/technology/developers-...
[3] https://arxiv.org/abs/2111.07624
8 replies →
IIRC isn't the symmetry between Q and K also broken by the direction of the softmax? I mean, row vs column-wise application yields different interpretation.
Yes but in practice, if you compute K=X.wk, Q=X.wq and then K.tQ you make three matrice multiplication. Wouldn't be faster to compute W=wk.twq beforhand and then just X.W.tX which will be just two matrices multiplication ? Is there something I am missing ?
1 reply →
Oh yes! That's probably more important, in fact.
1 reply →
I find it really confusing as well. The analogy implies we have something like Q[K] = V
For one, I have no idea how this relates to the mathematical operations of calculating attention score, applying softmax and than doing dot product with the V matrix.
Second just conceptually I don't understand how this relates to the "a word looks up to how relevant it is to another word". So if you have "The cat eats his soup", "his" queries how it's important it is to cat. So is V just numerical result of the significance, like 0.99?
I dont think Im very stupid but after seeing a dozens of these, I am starting to wonder if anyone actually understands this conceptually
Not sure how helpful it is, but: Words or concepts are represented as high-dim vectors. At high level, we could say each dimension is another concept like "dog"-ness or "complexity" or "color"-ness. The "a word looks up to how relevant it is to another word" is basically just relevance=distance=vector dot product. and the dot product can be distorted="some directions are more important" for one purpose or another(q/k/v matrixes distort the dot product). softmax is just a form of normalization (all sums to 1 = proper probability). The whole shebang works only because all pieces can be learned by gradient descent, otherwise it would be impossible to implement.
Does that book require some sort of technical prerequisite to understand?
It helps if you have some basic linear algebra, for sure - matrices, vectors, etc. That's probably the most important thing. You don't need to know pytorch, which is introduced in the book as needed and in an appendix. If you want to really understand the chapters on pre-training and fine-tuning you'll need to know a bit of machine learning (like a basic grasp of loss functions and gradient descent and backpropagation - it's sort of explained in the book but I don't think I'd have understood it much without having trained basic neural networks before), but that is not required so much for the earlier chapters on the architecture, e.g. how the attention mechanism works with Q, K, V as discussed in this article.
The best part about it is seeing the code built up for the GPT-2 architecture in basic pytorch, and then loading in the real GPT-2 weights and they actually work! So it's great for learning but also quite realistic. It's LLM architecture from a few years ago (to keep it approachable), but Sebastian has some great more advanced material on modern LLM architectures (which aren't that different) on his website and in the github repo: e.g. he has a whole article on implementing the Qwen3 architecture from scratch.
4 replies →
One of the big problems with Attention Mechanisms is that the Query needs to look over every single key, which for long contexts becomes very expensive.
A little side project I've been working on is to train a model that sits on top of the LLM, looks at each key and determines whether it's needed after a certain lifespan, and evicts it if possible (after the lifespan is expired). Still working on it, but my first pass test has a reduction of 90% of the keys!
https://github.com/enjeyw/smartkv
Is this not similar to DeepSeek lighting indexer
QKV attention is just a probabilistic lookup table where QKV allow adjusting dimensions of input/output to fit into your NN block. If your Q perfectly matches some known K (from training) then you get the exact V otherwise you get some linear combination of all Vs weighted by the attention.
It's not, please read the thread above.
These metaphorical database analogies bug me, and from what it seems like, a lot of other people in comments! So far some of the most reasonable explanations I have found that take training dynamics into account are from Lenka Zdeborova's lab (albeit in toy, linear attention settings but it's easy to see why they generalize to practical ones). For instance, this is a lovely paper: https://arxiv.org/abs/2509.24914
I published a video that explains Self-Attention and Multi-head attention in a different way -- going from intuition, to math, to code starting from the end-result and walking backward to the actual method.
Hopefully this sheds light on this important topic in a way that is different than other approaches and provides the clarity needed to understand Transformer architecture. It starts at 41:22 in the below video.
https://youtu.be/6jyL6NB3_LI?t=2482
The confusing thing about attention in this article (and the famous "Attention is all you need" paper it's derived from) is the heavy focus on self-attention. In self-attention, Q/K/V are all derived from the same input tokens, so it's confusing to distinguish their respective purposes.
I find attention much easier to understand in the original attention paper [0], which focuses on cross-attention for machine translation. In translation, the input sentence to be translated is tokenized into vectors {x_1...x_n}. The translated sentence is autoregressively generated into tokens {y_1...y_m}. To generate y_j, the model computes a similarity score of the previously generated token y_{j-1} against every x_i via the dot product s_{i,j} = x_i*K*y_{j-1}, transformed by the Key matrix. These are then softmaxed to create a weight vector a_j = softmax_i(s_{i,j}). The weighted average of X = [x_1|...|x_n] is taken with respect to a_j and transformed by the Value matrix, i.e. c_j = V*X*a_j. c_j is then passed to additional network layers to generate the output token y_j.
tl;dr: given the previous output token, compute its similarity to each input token (via K). Use those similarity scores to compute a weighted average across all input tokens, and use that weighted average to generate the next output token (via V).
Note that in this paper, the Query matrix is not explicitly used. It can be thought of as a token preprocessor: rather than computing s_{i,j} = x_i*K*y_{j-1}, each x_i is first linearly transformed by some matrix Q. Because this paper used an RNN (specifically, an LSTM) to encode the tokens, such transformations on the input tokens are implicit in each LSTM module.
[0] https://arxiv.org/pdf/1508.04025 (predates "Attention is all you need" by 3 years)
Very much this, cross attention and the x, y notation makes the similarity / covariance matrix far more clear and intuitive.
Also forget the terms "query", "key" and "value", or vague analogies to key-value stores, that is IMO a largely false analogy, and certainly not a helpful way to understand what is happening.
100% agreed. Attention finally clicked for me when I realized "wait, it's just a transformed, weighted dot product and has nothing to do with key/value lookups." I would have gotten this a lot faster had they called the key matrix \Sigma.
Isn't the Bahdanau attention even earlier[0]?
[0] https://arxiv.org/abs/1409.0473
I think of it more from an information retrieval (i.e. search) perspective.
Imagine the input text as though it were the whole internet and each page is just 1 token. Your job is to build a neural-network Google results page for that mini internet of tokens.
In traditional search, we are given a search query, and we want to find web pages via an intermediate search results page with 10 blue links. Basically, when we're Googling something, we want to know "What web pages are relevant to this given search query?", and then given those links we ask "what do those web pages actually say?" and click on the links to answer our question. In this case, the "Query" is obviously the user search query, the "Key" is one of the ten blue links (usually the title of the page), and the "Value" is the content of the web page that link goes to.
In the attention mechanism, we are given a token and we want to find its meaning when contextualized with other tokens. Basically, we are first trying to answer the question "which other tokens are relevant to this token?", and then given the answer to that we ask "what is the meaning of the original token given these other relevant tokens?" The "Query" is a given token in the input text, the "Key" is another token in the input text, and the "Value" is the final meaning of the original token with that other token in context (in the form of an embedding). For a given token, you can imagine it is as though the attention mechanism "clicked the 10 blue links" of the other most relevant tokens in the input and combined them in some way to figure out the meaning of the original query token (and also you might imagine we ran such a query in parallel for every token in the input text at the same time).
So the self attention mechanism is basically google search but instead of a user query, it's a token in the input, instead of a blue link, it's another token, and instead of a web page, it's meaning.
Read through my comments and those of others in this thread, the way you are thinking here is metaphorical and so disconnected from the actual math as to be unhelpful. It is not that case that you can gain a meaningful understanding of deep networks by metaphor. You actually need to learn some very basic linear algebra.
Heck, attention layers never even see tokens. Even the first self-attention layer sees positional embeddings, but all subsequent attention layers are just seeing complicated embeddings that are a mish-mash of the previous layers' embeddings.
Thanks for the post and the explanation.
I really enjoyed this relevant article about prompt caching where the author explained some of the same principles and used some additional visuals, though the main point there was why KV cache hits makes your LLM API usage much cheaper: https://ngrok.com/blog/prompt-caching/
Nice, I tried to writeup a simpler explanation for LLM a few days back too @ https://kaamvaam.com/machine-learning-ai/llm-attention-expla... One thing that stumped for a bit is the need for matrix V.
"When we read a sentence like “The cat sat on the mat because it was comfortable,” our brain automatically knows that “it” refers to “the mat” and not “the cat.” "
Am I the only one who thinks it's not obvious the "it" refers to the mat? The cat could be sitting on the mat because the cat is comfortable
You are correct. This is pronoun ambiguity. I also immediately noticed it and was displeased to see it as the opener of the article. As in, I no longer expected correctness of anything else the author would write (I wouldn't normally be so harsh, but this is about text processing. Being correct about simple linguistic cases is critical)
For anyone interested, the textbook example would be:
> "The trophy would not fit in the suitcase because it was too big."
"it" may refer to either the suitcase or the trophy. It is reasonable here to assume "it" refers to the trophy being too large, as that makes the sentence logically valid. But change the sentence to
> "The trophy would not fit in the suitcase because it was too small."
Why would the cat being comfortable make it sit on a mat?
Many sentences require you to have some knowledge of the world to process. In this case, you need to have the knowledge that "being comfortable dictates where you sit" doesn't happen nearly as often as "where you sit dictates your comfort."
Even for humans NLP is probabilistic, which is why we still often get it wrong. Or at least I know that I do.
Ah, but cats won't just comfortably sit on a mat if they feel there is danger. They will only sit on a mat if they feel comfortable! Absent larger context, the sentence is in fact ambiguous (though I agree your reading is the most natural and obvious one).
3 replies →
I think "it" refers to the process of sitting on the mat.
I have a totally different interpretation and I'm not sharing, folks.
The LLM smell is now an oxford comma