Comment by CephalopodMD

2 months ago

I think of it more from an information retrieval (i.e. search) perspective.

Imagine the input text as though it were the whole internet and each page is just 1 token. Your job is to build a neural-network Google results page for that mini internet of tokens.

In traditional search, we are given a search query, and we want to find web pages via an intermediate search results page with 10 blue links. Basically, when we're Googling something, we want to know "What web pages are relevant to this given search query?", and then given those links we ask "what do those web pages actually say?" and click on the links to answer our question. In this case, the "Query" is obviously the user search query, the "Key" is one of the ten blue links (usually the title of the page), and the "Value" is the content of the web page that link goes to.

In the attention mechanism, we are given a token and we want to find its meaning when contextualized with other tokens. Basically, we are first trying to answer the question "which other tokens are relevant to this token?", and then given the answer to that we ask "what is the meaning of the original token given these other relevant tokens?" The "Query" is a given token in the input text, the "Key" is another token in the input text, and the "Value" is the final meaning of the original token with that other token in context (in the form of an embedding). For a given token, you can imagine it is as though the attention mechanism "clicked the 10 blue links" of the other most relevant tokens in the input and combined them in some way to figure out the meaning of the original query token (and also you might imagine we ran such a query in parallel for every token in the input text at the same time).

So the self attention mechanism is basically google search but instead of a user query, it's a token in the input, instead of a blue link, it's another token, and instead of a web page, it's meaning.

1 comment

CephalopodMD

D-Machine 2 months ago

Read through my comments and those of others in this thread, the way you are thinking here is metaphorical and so disconnected from the actual math as to be unhelpful. It is not that case that you can gain a meaningful understanding of deep networks by metaphor. You actually need to learn some very basic linear algebra.

Heck, attention layers never even see tokens. Even the first self-attention layer sees positional embeddings, but all subsequent attention layers are just seeing complicated embeddings that are a mish-mash of the previous layers' embeddings.