Comment by mirekrusin

2 years ago

Did somebody try 2 or 5 instead of 3?

I'm really speculating here!

I think 3 is a good fit, since:

the K + Q (both together) represent simply pairwise importance (of tokens + positional embedding representation)

The V lifts that up one "abstraction level": K+Q alone wouldn't be able to differentiate following:

  I went to the *bank* and noticed my account was empty, so I went to a river *bank* and cried.

Somehow, the V feature matrix might contain one "filter" for concept of a river bank and the second "filter" for the money bank.

I'm only starting the process of learning that, I might be terribly wrong here :)

They just clone it a few times and call it multi head attention. 2 doesn’t really make sense. 3 is the right number.