Comment by vessenes

3 hours ago

Yes. That is, it is if you imagine a magically good self attention mechanism that could decide what in the context to attend to at any one moment. Then it would be like working with a polymath that has incredible memory. Or bringing in that aged but still senior Chief of Staff of a large company that knows where every body was buried, and why every decision was made at the time it was made, or a professor of film that has seen and can remember thousands of films.

Shockingly, we seem to have found a self attention mechanism of that quality, it just has the sad property of growing at O(N^2) where N is the context length.