← Back to context

Comment by empiricus

1 day ago

Not sure how helpful it is, but: Words or concepts are represented as high-dim vectors. At high level, we could say each dimension is another concept like "dog"-ness or "complexity" or "color"-ness. The "a word looks up to how relevant it is to another word" is basically just relevance=distance=vector dot product. and the dot product can be distorted="some directions are more important" for one purpose or another(q/k/v matrixes distort the dot product). softmax is just a form of normalization (all sums to 1 = proper probability). The whole shebang works only because all pieces can be learned by gradient descent, otherwise it would be impossible to implement.