Comment by SilasX
2 years ago
>However, explaining that a LLMs are really just iterated next-word prediction based on a statistical model of the preceding words is something that most people can grok, and in a useful way: in my experience, it actually helps give people a useful intuition for why and how models hallucinate and what kind of things they're good/bad at.
At risk of showing my deficient understanding: that isn't actually true, is it?[1]
LLMs do much more than simply predict subsequent text, AFAICT. I think back to the earlier LLM results that wowed the world with stuff like "now I have a model where you can ask it 'king - man + woman' ... and it returns 'queen'. Trippy!"
That is pretty clearly not mere text prediction, at least not identically. Even nearly all of the computational "hard work" comes from the math of predicting subsequent letters, that whole computation is introducing a new primitive, the concept of combining, or doing math on, models to produce other models, and thereby making a statement about what words would come next in a variety of scenarios.
That primitive is not present in the Markov predictor example, at least not without positing how a Markov predictor would be similarly transformed -- which, being very ignorant on the matter, I'm not sure is possible or not, but either way, leaves out a critical construct that enables ChatGPT to e.g. find limericks that aren't preceded by the command "Write me a limerick", as in my earlier comment[1].
[1] Earlier comment on why I think it's dubious to call LLM-based products such as ChatGPT "mere LMs": https://news.ycombinator.com/item?id=35472089
Predicting subsequent text is pretty much exactly what they do. Lots of very cool engineering that’s a real feat, but at its core it’s argmax(P(token|token,corpus)):
https://github.com/facebookresearch/llama/blob/main/llama/ge...
The engineering feats are up there with anything, but it’s a next token predictor.
Did read the part about king - man + woman = queen? How is that a next-token predictor?
With the caveat that I'm not an LLM expert here but have read up on some of this...
What's basically going on is that the LLM has "read" vast amounts of text and classified words in all kinds of dimensions. As it goes from word to word it's predicting based on all of those dimensions, rather than the Markov chain's simple probabilities. So it knows "king" has high values of masculinity and authority, for example, and "man" has high values of masculinity without necessary the authority. Likewise for queen and woman. This also works for connective words like plus, minus, equals, etc. This leads the LLM to judge correctly that the most plausible word to continue that equation is "queen".
With enough dimensions and enough context (preceding words to include in the calculation) you get results that look like reasoning and intelligence -- and although a lot of people have started arguing that we need to define reasoning and intelligence in a way that excludes this, I'm not so sure of that. It seems quite possible that what goes on in our own heads is not so far from this with a few extra steps.
3 replies →
It works by computing that P(queen|"king-man+woman", corpus) > P(<other-words>|"king-man+woman", corpus), i.e. it predicts that the most likely next token after that phrase, based on the entire training corpus and the loss function, is "queen".
Now, how exactly the LLM is computing this prediction remains poorly understood.
1 reply →
I was curious what GPT 4 would answer with that prompt:
input: king - man + woman =
"The equation you're referring to is often used as an example of how word embeddings work in natural language processing (NLP). Word embeddings are a type of word representation that allows words to be represented as vectors in a continuous vector space. This allows for capturing semantic similarities between words based on their context.
The equation "king - man + woman" is a famous example that illustrates how word embeddings can capture analogies. In vector space, the operation is meant to find a word that is similar to "king" but more closely associated with "woman" than "man". The result of this operation is often "queen".
This result demonstrates the ability of certain word embedding models, like Word2Vec or GloVe, to capture relational similarities between words. It's a reflection of the model's capacity to understand that the relationship between "king" and "man" is similar to the relationship between "queen" and "woman", based on the learned vector representations of these words from large text corpora."
word2vec is a very cool result, but suggestive at best for how a modern LLM works. in fact the king + country - queen + capital vector Mary you’re referring to is pretty much a direct consequence of excluding nonlinear ties from the SGD-driven matrix factorization process.
not everything done via SGD is a ReLU :)
Predicting the next token is how they are trained. But how they operate to achieve that goal is much, much more complicated than the principle of a Markov chain.
In theory, you are the byproduct of 'training' to survive and reproduce. Humans are very good at that task.
But their net capabilities developed to succeed at it extend far beyond the scope of the task alone.
> That primitive is not present in the Markov predictor example
It is. You need to go multi-dimensional; that's where "intelligence" emerges.
What you described was the vector space utility and mathematics which is the result of encoding it into a dimensional space (latent?).
SOTA LLM use vector space dimensional embeddings in order to utilize a coordinate based intelligence “for free” with a vector space prediction mechanism on the index of context.
Thanks, but can you clarify which confusion in my comment you're untangling there?
My understanding is that LLMs are basically approximations of Markov chains where the state and probability distribution is thousands of words long. If you could directly compute and use that matrix, you'd get the same result. But that would be insane.
I'm pretty sure that is a property of the word2vec embeddings. I'm not 100% sure if the embeddings/hidden states in LLM's have the same property but my guess would be they do.