Comment by mkw5053
2 months ago
I'm interested in how this would work for generative models. It's not obvious how you'd implement causal masking in the frequency domain. And the modReLU activation seems critical but adds implementation complexity. Would love to see how this scales on truly massive context lengths where the theoretical advantages should really shine.
No comments yet
Contribute on Hacker News ↗