Comment by crystal_revenge
5 hours ago
What you're missing is that there's no need to do extra work in the kernel smoothing step (what attention essentially is) because all the fancy transformation work is already happening in learning the kernel.
The feedforward networks prior to the attention layer are effectively learning sophisticated kernels. If you're unfamiliar (or for those who are) a Kernel is just a generalization of the dot product which is the most fundamental way of defining "similarity" between two points.
By learning a kernel the transformer is learning the best way to define what "similar" means for the task at hand and then we simply apply some basic smoothing over the data. This will handle all sort of interesting ways to compare points and that comparison will allow all points to provide a little bit of information.
Anything you could hope to achieve by performing more comparisons would be better solved by a better similarity function.
No comments yet
Contribute on Hacker News ↗