Comment by ethan_smith
4 days ago
The performance maintenance (or even improvement) isn't surprising - sparse attention can reduce noise by focusing only on relevant tokens. Traditional full attention dilutes focus by attending to everything equally, while NSA's pruning approach mimics how humans selectively process information.
No comments yet
Contribute on Hacker News ↗