Comment by oofbey
11 hours ago
Depending on how different the attention mechanism is, that might not work. If it’s just a faster / different way of finding the tokens to attend to, sure. But I get the sense the author is implying this method uses different semantics somehow. Although tbh I didn’t follow it entry.
No comments yet
Contribute on Hacker News ↗