Comment by andersource

2 months ago

A fully connected layer has different weights for each feature (or position in input in your formulation). So the word "hello" would be treated completely differently if it were to appear in position 15 vs. 16, for example.

Attention, by contrast, would treat those two occurrences similarly, with the only difference depending on positional encoding - so you can learn generalized patterns more easily.