← Back to context

Comment by jsenn

16 days ago

This doesn’t answer your question, but one thing to keep in mind is that past the very first layer, every “token” position is a weighted average of every previous position, so adjacency isn’t necessarily related to adjacent input tokens.

A borderline tautological answer might be “because the network learns that putting related things next to each other increases the usefulness of the convolutions”