Comment by nbardy
10 hours ago
This has been my thought for a long time. I think all that matters from attention is that there is crosswise comparison going on.
You need some amount of parallel compute and some amount of global comparison.
And the rest is basically a ways to parameters and scale.
(This is in theory, in practice you can get a lot of small % stability and efficiency improvements that really compound in algorithmic details of model architecture)
No comments yet
Contribute on Hacker News ↗