Comment by benreesman

2 years ago

Predicting subsequent text is pretty much exactly what they do. Lots of very cool engineering that’s a real feat, but at its core it’s argmax(P(token|token,corpus)):

https://github.com/facebookresearch/llama/blob/main/llama/ge...

The engineering feats are up there with anything, but it’s a next token predictor.

Did read the part about king - man + woman = queen? How is that a next-token predictor?

  • With the caveat that I'm not an LLM expert here but have read up on some of this...

    What's basically going on is that the LLM has "read" vast amounts of text and classified words in all kinds of dimensions. As it goes from word to word it's predicting based on all of those dimensions, rather than the Markov chain's simple probabilities. So it knows "king" has high values of masculinity and authority, for example, and "man" has high values of masculinity without necessary the authority. Likewise for queen and woman. This also works for connective words like plus, minus, equals, etc. This leads the LLM to judge correctly that the most plausible word to continue that equation is "queen".

    With enough dimensions and enough context (preceding words to include in the calculation) you get results that look like reasoning and intelligence -- and although a lot of people have started arguing that we need to define reasoning and intelligence in a way that excludes this, I'm not so sure of that. It seems quite possible that what goes on in our own heads is not so far from this with a few extra steps.

    • So you're saying that the word2vec king/queen result was the result of feeding the verbatim text "king+man-woman", and that text being continued as "queen"? I assumed that was more the result of doing math on the properties the model generated for the tokens king, man, queen, etc.

      And in that case, why does "Write me a limerick <meeting this criteria>" result in ChatGPT producing limericks, when vanishingly few of the limericks in the source text started that way, and vanishingly few of the commands to write a limerick were immediately followed with one?

      2 replies →

  • It works by computing that P(queen|"king-man+woman", corpus) > P(<other-words>|"king-man+woman", corpus), i.e. it predicts that the most likely next token after that phrase, based on the entire training corpus and the loss function, is "queen".

    Now, how exactly the LLM is computing this prediction remains poorly understood.

    • word2vec is pretty much completely understood. there are maybe like some bounds that could maybe pushed lower but we know exactly how it works.

  • I was curious what GPT 4 would answer with that prompt:

    input: king - man + woman =

    "The equation you're referring to is often used as an example of how word embeddings work in natural language processing (NLP). Word embeddings are a type of word representation that allows words to be represented as vectors in a continuous vector space. This allows for capturing semantic similarities between words based on their context.

    The equation "king - man + woman" is a famous example that illustrates how word embeddings can capture analogies. In vector space, the operation is meant to find a word that is similar to "king" but more closely associated with "woman" than "man". The result of this operation is often "queen".

    This result demonstrates the ability of certain word embedding models, like Word2Vec or GloVe, to capture relational similarities between words. It's a reflection of the model's capacity to understand that the relationship between "king" and "man" is similar to the relationship between "queen" and "woman", based on the learned vector representations of these words from large text corpora."

  • word2vec is a very cool result, but suggestive at best for how a modern LLM works. in fact the king + country - queen + capital vector Mary you’re referring to is pretty much a direct consequence of excluding nonlinear ties from the SGD-driven matrix factorization process.

    not everything done via SGD is a ReLU :)

Predicting the next token is how they are trained. But how they operate to achieve that goal is much, much more complicated than the principle of a Markov chain.

In theory, you are the byproduct of 'training' to survive and reproduce. Humans are very good at that task.

But their net capabilities developed to succeed at it extend far beyond the scope of the task alone.