Comment by noosphr
4 hours ago
Yes, and it works in theory.
Less so in practice. You saturate the memory of a b200 with a few dozen tokens on attentions higher than order 4. Training is even worse.
To paraphrase Knuth: high order polynomials are much more unimaginably large than mere infinity.
No comments yet
Contribute on Hacker News ↗