Comment by antoineMoPa
2 months ago
What I don't get about attention is why it would be necessary when a fully connected layer can also "attend" to all of the input. With very small datasets (think 0 - 500 tokens), I found that attention makes training longer and results worse. I guess the benefits show up with much larger datasets. Note that I'm an AI noob just doing some personal AI projects, so I'm not exactly a reference.
A fully connected layer has different weights for each feature (or position in input in your formulation). So the word "hello" would be treated completely differently if it were to appear in position 15 vs. 16, for example.
Attention, by contrast, would treat those two occurrences similarly, with the only difference depending on positional encoding - so you can learn generalized patterns more easily.
I think that this is the explanation I needed, thanks!
This is the case with most clever neural architectures: in theory, you could always replace them with dense layers that would perform better with enough resources/training, but that's just it, efficiency matters (number of parameters, training data, training time, FLOPS) and dense layers aren't as efficient (to put it mildly).
You have seen this play out on a small scale, but if you calculate the size of the dense layers necessary to even theoretically replicate a big attention layer or even convolution, to say nothing of the data needed to train them without the help of the architecture's inductive bias, you will see that the clever architectures are quite necessary at scale.
attention grows dynamically with the input size - mlps not