← Back to context

Comment by D-Machine

2 days ago

Read through my comments and those of others in this thread, the way you are thinking here is metaphorical and so disconnected from the actual math as to be unhelpful. It is not that case that you can gain a meaningful understanding of deep networks by metaphor. You actually need to learn some very basic linear algebra.

Heck, attention layers never even see tokens. Even the first self-attention layer sees positional embeddings, but all subsequent attention layers are just seeing complicated embeddings that are a mish-mash of the previous layers' embeddings.