← Back to context

Comment by jsenn

7 hours ago

You can find papers discussing "cubic" attention, i.e. each token gets to interact with each pair of other tokens, but always in very theoretical settings with single-layer transformers on contrived synthetic tasks.

Keep in mind that LLMs have many many layers, so they have plenty of opportunity to model higher-order interactions without needing to brute force every possible combination of 10 previous tokens, of which the vast majority will be useless. Empirically, even full "quadratic" attention is not always necessary, as evidenced by the existence of linear/sparse attention variants that perform almost as well.