Comment by naveen99

2 years ago

I just think of scaled dot product attention as a generalized convolution mechanism. The query, key, value jargon is a little confusing. All 3 are derived from the same signal in self attention and just multiplied with each other. Who knows why it works. And what hyper parameters are good for what data? what’s the ideal sequence size ?

Did somebody try 2 or 5 instead of 3?

  • I'm really speculating here!

    I think 3 is a good fit, since:

    the K + Q (both together) represent simply pairwise importance (of tokens + positional embedding representation)

    The V lifts that up one "abstraction level": K+Q alone wouldn't be able to differentiate following:

      I went to the *bank* and noticed my account was empty, so I went to a river *bank* and cried.
    

    Somehow, the V feature matrix might contain one "filter" for concept of a river bank and the second "filter" for the money bank.

    I'm only starting the process of learning that, I might be terribly wrong here :)

  • They just clone it a few times and call it multi head attention. 2 doesn’t really make sense. 3 is the right number.