Comment by naveen99

2 years ago

I just think of scaled dot product attention as a generalized convolution mechanism. The query, key, value jargon is a little confusing. All 3 are derived from the same signal in self attention and just multiplied with each other. Who knows why it works. And what hyper parameters are good for what data? what’s the ideal sequence size ?

7 comments

naveen99

adamnemecek 2 years ago

It's a convolution of a Hopf algebra.

esafak 2 years ago
I look forward to the day I can respond "Obviously".
- adamnemecek 2 years ago
  
  I have a discord for this https://discord.cofunctional.ai

mirekrusin 2 years ago

Did somebody try 2 or 5 instead of 3?

eurekin 2 years ago
I'm really speculating here!
I think 3 is a good fit, since:
the K + Q (both together) represent simply pairwise importance (of tokens + positional embedding representation)
The V lifts that up one "abstraction level": K+Q alone wouldn't be able to differentiate following:
I went to the *bank* and noticed my account was empty, so I went to a river *bank* and cried.
Somehow, the V feature matrix might contain one "filter" for concept of a river bank and the second "filter" for the money bank.
I'm only starting the process of learning that, I might be terribly wrong here :)
naveen99 2 years ago

They just clone it a few times and call it multi head attention. 2 doesn’t really make sense. 3 is the right number.