Comment by fheinsen
17 days ago
The github repository's first toy example is with 8 Taylor terms, applied to a context with 1B tokens, each token with 1K heads:
https://github.com/glassroom/sata_attention
That toy example is not practical with the quadratic formulation, because it would require computing and storing 1K attention matrices, each with 1B×1B dot-product scores. For example, at Float32 precision, those attention matrices would consume approximately 1K x 1B x 1B x 4 bytes = 3,725,290,298.5 Terabytes of memory, which is not practical.
Like every other proposed method, this one must be tested too. If it performs well in practice, AI service providers who ignore it will find themselves at a disadvantage.
Otherwise, the mathematical techniques introduced by this work are likely useful for other applications besides Transformer attention.
No comments yet
Contribute on Hacker News ↗