Comment by jsenn
1 day ago
This doesn’t answer your question, but one thing to keep in mind is that past the very first layer, every “token” position is a weighted average of every previous position, so adjacency isn’t necessarily related to adjacent input tokens.
A borderline tautological answer might be “because the network learns that putting related things next to each other increases the usefulness of the convolutions”
No comments yet
Contribute on Hacker News ↗