Comment by hollosi
13 hours ago
I would not be surprised if it turned out the exact attention mechanism does not really matter, similarly to the sigmoid, ReLU, GELU movement, only the speed on calculation - and QKV is pretty good at that on the GPUs.
This has been my thought for a long time. I think all that matters from attention is that there is crosswise comparison going on.
You need some amount of parallel compute and some amount of global comparison.
And the rest is basically a ways to parameters and scale.
(This is in theory, in practice you can get a lot of small % stability and efficiency improvements that really compound in algorithmic details of model architecture)