Comment by SilasX

2 years ago

>However, explaining that a LLMs are really just iterated next-word prediction based on a statistical model of the preceding words is something that most people can grok, and in a useful way: in my experience, it actually helps give people a useful intuition for why and how models hallucinate and what kind of things they're good/bad at.

At risk of showing my deficient understanding: that isn't actually true, is it?[1]

LLMs do much more than simply predict subsequent text, AFAICT. I think back to the earlier LLM results that wowed the world with stuff like "now I have a model where you can ask it 'king - man + woman' ... and it returns 'queen'. Trippy!"

That is pretty clearly not mere text prediction, at least not identically. Even nearly all of the computational "hard work" comes from the math of predicting subsequent letters, that whole computation is introducing a new primitive, the concept of combining, or doing math on, models to produce other models, and thereby making a statement about what words would come next in a variety of scenarios.

That primitive is not present in the Markov predictor example, at least not without positing how a Markov predictor would be similarly transformed -- which, being very ignorant on the matter, I'm not sure is possible or not, but either way, leaves out a critical construct that enables ChatGPT to e.g. find limericks that aren't preceded by the command "Write me a limerick", as in my earlier comment[1].

[1] Earlier comment on why I think it's dubious to call LLM-based products such as ChatGPT "mere LMs": https://news.ycombinator.com/item?id=35472089

16 comments

SilasX

benreesman 2 years ago

Predicting subsequent text is pretty much exactly what they do. Lots of very cool engineering that’s a real feat, but at its core it’s argmax(P(token|token,corpus)):

https://github.com/facebookresearch/llama/blob/main/llama/ge...

The engineering feats are up there with anything, but it’s a next token predictor.

SilasX 2 years ago
Did read the part about king - man + woman = queen? How is that a next-token predictor?
- svachalek 2 years ago
  
  With the caveat that I'm not an LLM expert here but have read up on some of this...
  What's basically going on is that the LLM has "read" vast amounts of text and classified words in all kinds of dimensions. As it goes from word to word it's predicting based on all of those dimensions, rather than the Markov chain's simple probabilities. So it knows "king" has high values of masculinity and authority, for example, and "man" has high values of masculinity without necessary the authority. Likewise for queen and woman. This also works for connective words like plus, minus, equals, etc. This leads the LLM to judge correctly that the most plausible word to continue that equation is "queen".
  With enough dimensions and enough context (preceding words to include in the calculation) you get results that look like reasoning and intelligence -- and although a lot of people have started arguing that we need to define reasoning and intelligence in a way that excludes this, I'm not so sure of that. It seems quite possible that what goes on in our own heads is not so far from this with a few extra steps.
  
  3 replies →
- tsimionescu 2 years ago
  
  It works by computing that P(queen|"king-man+woman", corpus) > P(<other-words>|"king-man+woman", corpus), i.e. it predicts that the most likely next token after that phrase, based on the entire training corpus and the loss function, is "queen".
  Now, how exactly the LLM is computing this prediction remains poorly understood.
  
  1 reply →
- bagels 2 years ago
  
  I was curious what GPT 4 would answer with that prompt:
  input: king - man + woman =
  "The equation you're referring to is often used as an example of how word embeddings work in natural language processing (NLP). Word embeddings are a type of word representation that allows words to be represented as vectors in a continuous vector space. This allows for capturing semantic similarities between words based on their context.
  The equation "king - man + woman" is a famous example that illustrates how word embeddings can capture analogies. In vector space, the operation is meant to find a word that is similar to "king" but more closely associated with "woman" than "man". The result of this operation is often "queen".
  This result demonstrates the ability of certain word embedding models, like Word2Vec or GloVe, to capture relational similarities between words. It's a reflection of the model's capacity to understand that the relationship between "king" and "man" is similar to the relationship between "queen" and "woman", based on the learned vector representations of these words from large text corpora."
- benreesman 2 years ago
  
  word2vec is a very cool result, but suggestive at best for how a modern LLM works. in fact the king + country - queen + capital vector Mary you’re referring to is pretty much a direct consequence of excluding nonlinear ties from the SGD-driven matrix factorization process.
  not everything done via SGD is a ReLU :)
kromem 2 years ago

Predicting the next token is how they are trained. But how they operate to achieve that goal is much, much more complicated than the principle of a Markov chain.
In theory, you are the byproduct of 'training' to survive and reproduce. Humans are very good at that task.
But their net capabilities developed to succeed at it extend far beyond the scope of the task alone.

csomar 2 years ago

> That primitive is not present in the Markov predictor example

It is. You need to go multi-dimensional; that's where "intelligence" emerges.

pyinstallwoes 2 years ago

What you described was the vector space utility and mathematics which is the result of encoding it into a dimensional space (latent?).

SOTA LLM use vector space dimensional embeddings in order to utilize a coordinate based intelligence “for free” with a vector space prediction mechanism on the index of context.

SilasX 2 years ago

Thanks, but can you clarify which confusion in my comment you're untangling there?

acjohnson55 2 years ago

My understanding is that LLMs are basically approximations of Markov chains where the state and probability distribution is thousands of words long. If you could directly compute and use that matrix, you'd get the same result. But that would be insane.

sporkland 2 years ago

I'm pretty sure that is a property of the word2vec embeddings. I'm not 100% sure if the embeddings/hidden states in LLM's have the same property but my guess would be they do.