Comment by D-Machine
2 days ago
The main takeaway is that "attention" is a much broader concept generally, so worrying too much about the "scaled dot-product attention" of transformers deeply limits your understanding of what kinds of things really matter in general.
A paper I found particularly useful on this was generalizing even farther to note the importance of multiplicative interactions more generally in deep learning (https://openreview.net/pdf?id=rylnK6VtDH).
EDIT: Also, this paper I was looking for dramatically generalizes the notion of attention in a way I found to be quite helpful: https://arxiv.org/pdf/2111.07624
No comments yet
Contribute on Hacker News ↗