← Back to context

Comment by mnicky

3 days ago

IIRC isn't the symmetry between Q and K also broken by the direction of the softmax? I mean, row vs column-wise application yields different interpretation.

Yes but in practice, if you compute K=X.wk, Q=X.wq and then K.tQ you make three matrice multiplication. Wouldn't be faster to compute W=wk.twq beforhand and then just X.W.tX which will be just two matrices multiplication ? Is there something I am missing ?

  • Most models have a per-head dimension much smaller than the input dimension, so it's faster to multiply by the small wk and wk individually than to multiply by the large matrix W. Also, if you use rotary positional embeddings, the RoPE matrices need to be sandwiched in the middle and they're different for every token, so you could no longer premultiply just once.

Oh yes! That's probably more important, in fact.

  • Well, I think that this is also answer to your question about the intuition.

    If the assymetry of K and Q stems from the direction of the softmax application, it must also be the reason for the names of the matrices :)

    And if you think about it, it makes sense that for each Key, weights to all of the Queries sum to 1 and not vice versa.

    So this is my only intuition for the K and Q names.

    (It may or may not be similar to the whole "db lookup thing"... I just don't use that one.)