← Back to context Comment by mirekrusin 2 years ago Did somebody try 2 or 5 instead of 3? 3 comments mirekrusin Reply eurekin 2 years ago I'm really speculating here!I think 3 is a good fit, since:the K + Q (both together) represent simply pairwise importance (of tokens + positional embedding representation)The V lifts that up one "abstraction level": K+Q alone wouldn't be able to differentiate following: I went to the *bank* and noticed my account was empty, so I went to a river *bank* and cried. Somehow, the V feature matrix might contain one "filter" for concept of a river bank and the second "filter" for the money bank.I'm only starting the process of learning that, I might be terribly wrong here :) naveen99 2 years ago They just clone it a few times and call it multi head attention. 2 doesn’t really make sense. 3 is the right number.
eurekin 2 years ago I'm really speculating here!I think 3 is a good fit, since:the K + Q (both together) represent simply pairwise importance (of tokens + positional embedding representation)The V lifts that up one "abstraction level": K+Q alone wouldn't be able to differentiate following: I went to the *bank* and noticed my account was empty, so I went to a river *bank* and cried. Somehow, the V feature matrix might contain one "filter" for concept of a river bank and the second "filter" for the money bank.I'm only starting the process of learning that, I might be terribly wrong here :)
naveen99 2 years ago They just clone it a few times and call it multi head attention. 2 doesn’t really make sense. 3 is the right number.
I'm really speculating here!
I think 3 is a good fit, since:
the K + Q (both together) represent simply pairwise importance (of tokens + positional embedding representation)
The V lifts that up one "abstraction level": K+Q alone wouldn't be able to differentiate following:
Somehow, the V feature matrix might contain one "filter" for concept of a river bank and the second "filter" for the money bank.
I'm only starting the process of learning that, I might be terribly wrong here :)
They just clone it a few times and call it multi head attention. 2 doesn’t really make sense. 3 is the right number.