Comment by Al-Khwarizmi
6 days ago
With bag of words, the representation of a word is a vector whose dimension is the dictionary size, all components are zeros except for the component corresponding to that word, which is one.
This is not good to train neural networks (because they like to be fed dense, continuous data, not sparse and discrete) and it treats each word as an atomic entity without dealing with relationships between them (you don't have a way to know that the wprds "plane" and "airplane" are more related than "plane" and "dog").
With word embeddings, you get a space of continuous vectors with a predefined (lower) number of dimensions. This is more useful to serve as input or training data to neural networks, and it is a representation of the meaning space ("plane" and "airplane" will have very similar vectors, while the one for "dog" will be different) which opens up a lot of possibilities to make models and systems more robust.
Also important to note that in a Transformer-based LLM, embeddings are more than just a way of representing the input words. Embeddings are what pass through the transformer, layer by layer, and get transformed by it.
The size of the embedding space (number of vector dimensions) is therefore larger than needed to just represent word meanings - it needs to be large enough to also be able to represent the information added by these layer-wise transformations.
The way I think of these transformations, but happy to be corrected, is more a matter of adding information rather than modifying what is already there, so conceptually the embeddings will start as word embeddings, then maybe get augmented with part-of-speech information, then additional syntactic/parsing information, and semantic information, as the embedding gets incrementally enriched as it is "transformed" by successive layers.
> The way I think of these transformations, but happy to be corrected, is more a matter of adding information rather than modifying
This is very much the case considering the residual connections within the model. The final representation can be expressed as a sum of representations from N layers, where the N-th representation is a function of N-1-th.