Comment by ethan_smith
6 days ago
Attention weights can still assign non-zero probability to irrelevant tokens since the mechanism optimizes for prediction rather than semantic relevance, and these irrelevant tokens can create interference in the hidden state representations.
No comments yet
Contribute on Hacker News ↗