Comment by fheinsen
18 days ago
Unlike previous efforts, which typically stop at a low-order (e.g., quadratic) term of the Taylor expansion, this work derives a succinct, efficient, parallel general method for approximating attention with any number of Taylor terms, to arbitrary precision.
The github repository's first toy example is with 8 Taylor terms, applied to a context of 1B tokens, with attention computed over 1K heads per token. (Note that applying the quadratic formulation to 1B tokens, each with 1K heads, is not practical with current hardware, because it would require computing 1K attention matrices, each with 1B×1B dot-product scores.
Like every other proposed method, this one must be tested too. If it works, AI service providers who ignore it will find themselves at a disadvantage.
It's worth mentioning also that the mathematical techniques introduced by this work are likely of interest for other applications besides attention.
No comments yet
Contribute on Hacker News ↗