Comment by throw310822

2 months ago

I might be completely off road, but I can't help thinking of convolutions as my mental model for the K Q V mechanism. Attention has the same property of a convolution kernel of being trained independently of position; it learns how to translate a large, rolling portion of an input to a new "digested" value; and you can train multiple ones in parallel so that they learn to focus on different aspects of the input ("kernels" in the case of convolution, "heads" in the case of attention).

4 comments

throw310822

krackers 2 months ago

I think there are two key differences though: 1) Attention doesn't doesn't use fixed distance-dependent weight for the aggregation but instead the weight becomes "semantically-dependent", based on association between q/k. 2) A single convolution step is a local operation (only pulling from nearby pixels), whereas attention is a "global" operation, pulling from the hidden states of all previous tokens. (Maybe sliding window attention schemes muddy this distinction, but in general the degree of connectivity seems far higher).

There might be some unifying way to look at things though, maybe GNNs. I found this talk [1] and at 4:17 it shows how convolution and attention would be modeled in a GNN formalism

[1] https://www.youtube.com/watch?v=J1YCdVogd14

ActorNightly 2 months ago

>A single convolution step is a local operation (only pulling from nearby pixels), whereas attention is a "global" operation.
In the same way where the learned weights to generate K,Q,V matricies may have zeros (or small values) for referencing certain tokens, convolution kernels just have defined zeros.
sifar 2 months ago

Nested concolutikns, dilated convolutiona both can pull in data from further afar.

ActorNightly 2 months ago

The whole reason for the first "AI Winter" was because people were trying to solve problems with smaller neural nets, and of course you run into problems during training, where you can't get things to converge.

Once compute became more available, and you had more neural nets, and thus more dimensionality (in the sense of layer sizes), during training, you had more directions for gradient descent, so things started happening with ML.

And all the architectures that you see today are basically simplifications of the fully connected layers with max dimensionality. Any operation like attention, self attention, or convolution can be unrolled into matrix multiples.

I wouldn't be surprised if Google TPUs basically do this. It seems to reason that they are the most efficient because they don't move memory around, which means that the matrix multiply circuitry is hard wired, which means that the compiler basically has to lay out the locations of the data in the spaces that are meant to be matrix multiplied together, so the compiler probably does that unrolling under the hood.