Comment by D-Machine

3 days ago

Very much this, cross attention and the x, y notation makes the similarity / covariance matrix far more clear and intuitive.

Also forget the terms "query", "key" and "value", or vague analogies to key-value stores, that is IMO a largely false analogy, and certainly not a helpful way to understand what is happening.

100% agreed. Attention finally clicked for me when I realized "wait, it's just a transformed, weighted dot product and has nothing to do with key/value lookups." I would have gotten this a lot faster had they called the key matrix \Sigma.