Comment by krackers
8 hours ago
> But how do you get 'Paris' into the value vector in that case?
Ok wait I think I see what you mean. Although maybe it's not getting paris _into_ the value vector that's hard, but isolating the residual stream to _only_ that instead of things like other capitals.
So as a naive example maybe at the very first layer consuming your tokens: Q{France} would have high inner product with K{capital} and so our residual would now mostly contain V{capital}, which maybe contains embeddings of all the capitals of all countries. You need some way to filter out all the other stuff, but can't do that without a FFN + activation.
Just throwing in a relu by itself won't help since that would still work on all the elements uniformly, you need some way to put weight on "paris" while suppressing the others, i.e. mixing within the residual stream itself.
Although maybe if you really stretch it, somewhere in a deeper layer you could have 1-hot encoded values with a "gain" coefficient so that when you do the residual addition it's something like {<paris>, <tokyo>, <dc>} + 10000*{<1>, <0>, <0>} and then if you softmax that you get something with most of its mass on "Paris". But it seems like this would not be practical, or it's just shifting the issue to how that the right 1-hot vector is chosen
No comments yet
Contribute on Hacker News ↗