Comment by eurekin

2 years ago

I'm really speculating here!

I think 3 is a good fit, since:

the K + Q (both together) represent simply pairwise importance (of tokens + positional embedding representation)

The V lifts that up one "abstraction level": K+Q alone wouldn't be able to differentiate following:

  I went to the *bank* and noticed my account was empty, so I went to a river *bank* and cried.

Somehow, the V feature matrix might contain one "filter" for concept of a river bank and the second "filter" for the money bank.

I'm only starting the process of learning that, I might be terribly wrong here :)

0 comments

eurekin

No comments yet