← Back to context

Comment by eurekin

2 years ago

Those are such great questions. I'm also trying to find out and my current notes are as following.

Attention matrix, From the llm viz bbycroft.net:

  So the main goal of self-attention is that each column wants to find relevant
  information from other columns and extract their values, and does so by
  comparing its query vector to the keys of those other columns. With the added
  restriction that it can only look in the past.

Seems that a attention head (matrix) is a set of responses (for each token) to the question: looking only at past tokens, which are most relevant, when considering this one?

Huge speculation: it might be that this finds, let's say, first order of importance relations. So, the most important meaning. Adding a second head, might allow to find n-th order of importance, more subtle or nuanced considerations.

Or, simply in the course of ablation, it has been found that more heads is simply better than one :)

I asked chatgpt about that:

  Q: Seems that a attention head (matrix) is a set of responses (for each token) 
  to the question: looking only at past tokens, which are most relevant, when 
  considering this one?

  A: Yes, your interpretation is a good way to understand the role of an attention
  head in the context of models like GPT (Generative Pre-trained Transformer), 
  which use a causal or masked self-attention mechanism. Each attention head 
  effectively answers the question: "Given the current token, which of the 
  preceding tokens (including itself) are most relevant?" Here's a breakdown 
  of this process:

  [...]

  Multiple Attention Heads: It's important to note that modern transformer models 
  use multiple attention heads in parallel for each token. Each head can potentially
  focus on different aspects or patterns in the sequence, allowing the model to
  capture a richer and more nuanced understanding of the context.

Which seems to support that "n-th order of importance" interpretation.

As to why K, Q and V are simply linear transformations of the input - I'd guess it's the most simple way (computationally, while learning) that has enough expression power to represent cross-token, directed, relevancy.

Chatgpt response:

  A: Yes, the query (Q), key (K), and value (V) vectors in the self-attention
  mechanism are indeed linear transformations of the input, and there are several
  reasons for this design choice:

  [...]

  2. Sufficient Expressive Power: Despite their simplicity, linear transformations
  can be very powerful. They can project the input data into higher-dimensional spaces
  (or compress it into lower dimensions), where the relationships between different
  tokens can be more easily captured. This ability to reshape the representation
  space is crucial for capturing complex patterns in data.

I'm still suffering mightly trying to understand the real meaning behind K, Q and V though! I'm running the example:

  I went to the *bank* and noticed my account was empty, so I went to a river *bank* and cried.

through chatgpt and it seems to completely agree with all my questions, while at the same time slipping away from details into conclusions and platitudes.