Comment by eurekin
2 years ago
Those are such great questions. I'm also trying to find out and my current notes are as following.
Attention matrix, From the llm viz bbycroft.net:
So the main goal of self-attention is that each column wants to find relevant
information from other columns and extract their values, and does so by
comparing its query vector to the keys of those other columns. With the added
restriction that it can only look in the past.
Seems that a attention head (matrix) is a set of responses (for each token) to the question: looking only at past tokens, which are most relevant, when considering this one?
Huge speculation: it might be that this finds, let's say, first order of importance relations. So, the most important meaning. Adding a second head, might allow to find n-th order of importance, more subtle or nuanced considerations.
Or, simply in the course of ablation, it has been found that more heads is simply better than one :)
I asked chatgpt about that:
Q: Seems that a attention head (matrix) is a set of responses (for each token)
to the question: looking only at past tokens, which are most relevant, when
considering this one?
A: Yes, your interpretation is a good way to understand the role of an attention
head in the context of models like GPT (Generative Pre-trained Transformer),
which use a causal or masked self-attention mechanism. Each attention head
effectively answers the question: "Given the current token, which of the
preceding tokens (including itself) are most relevant?" Here's a breakdown
of this process:
[...]
Multiple Attention Heads: It's important to note that modern transformer models
use multiple attention heads in parallel for each token. Each head can potentially
focus on different aspects or patterns in the sequence, allowing the model to
capture a richer and more nuanced understanding of the context.
Which seems to support that "n-th order of importance" interpretation.
As to why K, Q and V are simply linear transformations of the input - I'd guess it's the most simple way (computationally, while learning) that has enough expression power to represent cross-token, directed, relevancy.
Chatgpt response:
A: Yes, the query (Q), key (K), and value (V) vectors in the self-attention
mechanism are indeed linear transformations of the input, and there are several
reasons for this design choice:
[...]
2. Sufficient Expressive Power: Despite their simplicity, linear transformations
can be very powerful. They can project the input data into higher-dimensional spaces
(or compress it into lower dimensions), where the relationships between different
tokens can be more easily captured. This ability to reshape the representation
space is crucial for capturing complex patterns in data.
I'm still suffering mightly trying to understand the real meaning behind K, Q and V though! I'm running the example:
I went to the *bank* and noticed my account was empty, so I went to a river *bank* and cried.
through chatgpt and it seems to completely agree with all my questions, while at the same time slipping away from details into conclusions and platitudes.
No comments yet
Contribute on Hacker News ↗