← Back to context

Comment by D-Machine

2 days ago

> I've never really got it and I just switched to thinking of QKV as a way to construct a fairly general series of linear algebra transformations on the input of a sequence of token embedding vectors x that is quadratic in x and ensures that every token can relate to every other token in the NxN attention matrix.

That's because what you say here is the correct understanding. The lookup thing is nonsense.

The terms "Query" and "Value" are largely arbitrary and meaningless in practice, look at how to implement this in PyTorch and you'll see these are just weight matrices that implement a projection of sorts, and self-attention is always just self_attention(x, x, x) or self_attention(x, x, y) in some cases (e.g. cross-attention), where x and y are are outputs from previous layers.

Plus with different forms of attention, e.g. merged attention, and the research into why / how attention mechanisms might actually be working, the whole "they are motivated by key-value stores" thing starts to look really bogus. Really it is that the attention layer allows for modeling correlations/similarities and/or multiplicative interactions among a dimension-reduced representation. EDIT: Or, as you say, it can be regarded as kernel smoothing.

Thanks! Good to know I’m not missing something here. And yeah, it’s always just seemed to me better to frame it as: let’s find a mathematical structure to relate every embedding vector in a sequence to every other vector, and let’s throw in a bunch of linear projections so that there are lots of parameters to learn during training to make the relationship structure model things from language, concepts, code, whatever.

I’ll have to read up on merged attention, I haven’t got that far yet!

  • The main takeaway is that "attention" is a much broader concept generally, so worrying too much about the "scaled dot-product attention" of transformers deeply limits your understanding of what kinds of things really matter in general.

    A paper I found particularly useful on this was generalizing even farther to note the importance of multiplicative interactions more generally in deep learning (https://openreview.net/pdf?id=rylnK6VtDH).

    EDIT: Also, this paper I was looking for dramatically generalizes the notion of attention in a way I found to be quite helpful: https://arxiv.org/pdf/2111.07624