Comment by seydor
2 years ago
And what do the different heads represent? Why are query, key, and values simply linear transforms of the input.
2 years ago
And what do the different heads represent? Why are query, key, and values simply linear transforms of the input.
Those are such great questions. I'm also trying to find out and my current notes are as following.
Attention matrix, From the llm viz bbycroft.net:
Seems that a attention head (matrix) is a set of responses (for each token) to the question: looking only at past tokens, which are most relevant, when considering this one?
Huge speculation: it might be that this finds, let's say, first order of importance relations. So, the most important meaning. Adding a second head, might allow to find n-th order of importance, more subtle or nuanced considerations.
Or, simply in the course of ablation, it has been found that more heads is simply better than one :)
I asked chatgpt about that:
Which seems to support that "n-th order of importance" interpretation.
As to why K, Q and V are simply linear transformations of the input - I'd guess it's the most simple way (computationally, while learning) that has enough expression power to represent cross-token, directed, relevancy.
Chatgpt response:
I'm still suffering mightly trying to understand the real meaning behind K, Q and V though! I'm running the example:
through chatgpt and it seems to completely agree with all my questions, while at the same time slipping away from details into conclusions and platitudes.