Comment by eurekin
2 years ago
I'm really speculating here!
I think 3 is a good fit, since:
the K + Q (both together) represent simply pairwise importance (of tokens + positional embedding representation)
The V lifts that up one "abstraction level": K+Q alone wouldn't be able to differentiate following:
I went to the *bank* and noticed my account was empty, so I went to a river *bank* and cried.
Somehow, the V feature matrix might contain one "filter" for concept of a river bank and the second "filter" for the money bank.
I'm only starting the process of learning that, I might be terribly wrong here :)
No comments yet
Contribute on Hacker News ↗