Comment by getnormality
15 hours ago
> So far it seems to me that self-attention really brought new capabilities to a network
Do we have a layman explanation for what makes self-attention so uniquely powerful? Something more than "it lets you do self-attention".
Computational power. Without self attention, you have a sloppy implementation of something called a PDA (push-down-automaton) -- like an old HP calculator. With it, you have an even sloppier implementation of a Turing machine.
So (modulo a _lot_ of details) it increases the power from that of a "calculator" to that of a "computer".