Comment by HarHarVeryFunny
5 days ago
Also important to note that in a Transformer-based LLM, embeddings are more than just a way of representing the input words. Embeddings are what pass through the transformer, layer by layer, and get transformed by it.
The size of the embedding space (number of vector dimensions) is therefore larger than needed to just represent word meanings - it needs to be large enough to also be able to represent the information added by these layer-wise transformations.
The way I think of these transformations, but happy to be corrected, is more a matter of adding information rather than modifying what is already there, so conceptually the embeddings will start as word embeddings, then maybe get augmented with part-of-speech information, then additional syntactic/parsing information, and semantic information, as the embedding gets incrementally enriched as it is "transformed" by successive layers.
> The way I think of these transformations, but happy to be corrected, is more a matter of adding information rather than modifying
This is very much the case considering the residual connections within the model. The final representation can be expressed as a sum of representations from N layers, where the N-th representation is a function of N-1-th.