Comment by ActorNightly
12 hours ago
>A single convolution step is a local operation (only pulling from nearby pixels), whereas attention is a "global" operation.
In the same way where the learned weights to generate K,Q,V matricies may have zeros (or small values) for referencing certain tokens, convolution kernels just have defined zeros.
No comments yet
Contribute on Hacker News ↗