← Back to context

Comment by libraryofbabel

2 days ago

Thanks! Good to know I’m not missing something here. And yeah, it’s always just seemed to me better to frame it as: let’s find a mathematical structure to relate every embedding vector in a sequence to every other vector, and let’s throw in a bunch of linear projections so that there are lots of parameters to learn during training to make the relationship structure model things from language, concepts, code, whatever.

I’ll have to read up on merged attention, I haven’t got that far yet!

The main takeaway is that "attention" is a much broader concept generally, so worrying too much about the "scaled dot-product attention" of transformers deeply limits your understanding of what kinds of things really matter in general.

A paper I found particularly useful on this was generalizing even farther to note the importance of multiplicative interactions more generally in deep learning (https://openreview.net/pdf?id=rylnK6VtDH).

EDIT: Also, this paper I was looking for dramatically generalizes the notion of attention in a way I found to be quite helpful: https://arxiv.org/pdf/2111.07624