Comment by lukev

2 years ago

What's actually happening in a LLM is many orders of magnitude more complex than a Markov chain. However, I agree that they're an amazing pedagogical tool for the basic principles of how a LLM works, even to a non-technical audience.

Many people try to "explain" LLMs starting with the principles of neural networks. This rarely works well: there are some significant conceptual leaps required.

However, explaining that a LLMs are really just iterated next-word prediction based on a statistical model of the preceding words is something that most people can grok, and in a useful way: in my experience, it actually helps give people a useful intuition for why and how models hallucinate and what kind of things they're good/bad at.

Markov chains are super simple iterated next-word prediction model, and can be explained in 15 minutes. It's a great way to explain LLMs to laypeople.

HMMs are much more similar to S3, S4, and other deep state-space models than to transformers.

In fact, HMMs are discrete shallow state-space models.

I believe S4 and successors might become serious contenders to transformers.

For a detailed tutorial, see https://srush.github.io/annotated-s4

>However, explaining that a LLMs are really just iterated next-word prediction based on a statistical model of the preceding words is something that most people can grok, and in a useful way: in my experience, it actually helps give people a useful intuition for why and how models hallucinate and what kind of things they're good/bad at.

At risk of showing my deficient understanding: that isn't actually true, is it?[1]

LLMs do much more than simply predict subsequent text, AFAICT. I think back to the earlier LLM results that wowed the world with stuff like "now I have a model where you can ask it 'king - man + woman' ... and it returns 'queen'. Trippy!"

That is pretty clearly not mere text prediction, at least not identically. Even nearly all of the computational "hard work" comes from the math of predicting subsequent letters, that whole computation is introducing a new primitive, the concept of combining, or doing math on, models to produce other models, and thereby making a statement about what words would come next in a variety of scenarios.

That primitive is not present in the Markov predictor example, at least not without positing how a Markov predictor would be similarly transformed -- which, being very ignorant on the matter, I'm not sure is possible or not, but either way, leaves out a critical construct that enables ChatGPT to e.g. find limericks that aren't preceded by the command "Write me a limerick", as in my earlier comment[1].

[1] Earlier comment on why I think it's dubious to call LLM-based products such as ChatGPT "mere LMs": https://news.ycombinator.com/item?id=35472089

  • Predicting subsequent text is pretty much exactly what they do. Lots of very cool engineering that’s a real feat, but at its core it’s argmax(P(token|token,corpus)):

    https://github.com/facebookresearch/llama/blob/main/llama/ge...

    The engineering feats are up there with anything, but it’s a next token predictor.

    • Predicting the next token is how they are trained. But how they operate to achieve that goal is much, much more complicated than the principle of a Markov chain.

      In theory, you are the byproduct of 'training' to survive and reproduce. Humans are very good at that task.

      But their net capabilities developed to succeed at it extend far beyond the scope of the task alone.

  • > That primitive is not present in the Markov predictor example

    It is. You need to go multi-dimensional; that's where "intelligence" emerges.

  • What you described was the vector space utility and mathematics which is the result of encoding it into a dimensional space (latent?).

    SOTA LLM use vector space dimensional embeddings in order to utilize a coordinate based intelligence “for free” with a vector space prediction mechanism on the index of context.

    • Thanks, but can you clarify which confusion in my comment you're untangling there?

  • My understanding is that LLMs are basically approximations of Markov chains where the state and probability distribution is thousands of words long. If you could directly compute and use that matrix, you'd get the same result. But that would be insane.

  • I'm pretty sure that is a property of the word2vec embeddings. I'm not 100% sure if the embeddings/hidden states in LLM's have the same property but my guess would be they do.

Genuine question: what do you mean by many orders of magnitude more complex?

  • I like your question, and I cannot answer it. But I have a benchmark: I can write a Markov chain "language model" in around 10-20 lines of Python, with zero external libraries -- with tokenization and "training" on a text file, and generating novel output. I wrote it in several minutes and didn't bother to save it.

    I'm curious how much time & code it would take to implement this LLM stuff at a similar level of quality and performance.

    • Generally LLM architectures are pretty low code, I thought (not written one myself).

      Then all of the complexity comes with the training/weight data.

  • A typical demonstration markov chain probably has a length of around 3. A typical recent LLM probably has more than three billion parameters. That's not precisely apppes to apples, but the LLM is certainly vastly more complicated.

    • Number of parameters is not the difference. A Markov chain can easily be a multi-dimensional matrix with millions of entries. The significant difference is that a length 3 Markov chain can only ever find connections between 3 adjacent symbols (words, usually). LLMs seem to be able to find and connect abstract concepts at a very long and variable distances in the input.

      Nevertheless I agree with the premise of the posting. I used Markov chains recently to teach someone what a statistical model of language is, followed by explaining to them the perceptron, and then (hand waving a bit) explaining how many large, deep layers scales everything up massively.

      3 replies →

They are an especially useful tool right now, that might become less valuable as we get better at building LLMs. In principle the inner working of an LLM can be anything from a Markov-chain-like predictor to a beyond-human intelligence. Token prediction is the input/output format we chose, but you could communicate with a human in the same format and the human would show human-level intelligence.

What makes Markov chains such a great pedagogic tool right now is that they share (approximately) the same interface, being a token predictor, and that current LLMs are much closer to the capabilities of a fantastically good Markov chain than those of an above-human intelligence.

  • > In principle the inner working of an LLM can be anything from a Markov-chain-like predictor to a beyond-human intelligence.

    I'm afraid I have to disagree. Next-token prediction isn't just the interface we use for LLMs, it is fundamentally what they are, to the very core. The training and loss function of the foundation models are completely oriented towards next-token accuracy.

    Reasonable people can disagree about emergent behavior and if/how much the model is "planning ahead" in its weights (and what that could even mean) but it is emphatically not the case that the "next token" model is "just an interface". The analogy to human thought isn't accurate at all: we have our own recursive/iterative thought process, short and long term memory, decision making loops, etc.

    A LLMs has no "thought" outside of next-token prediction and no working memory aside from its context window. We don't fully understand all the emergent behavior of a transformer model but we definitely understand exactly what's happening at the mechanical level: each token is determined, one at a time, by solving an extremely complex but deterministic equation in which the model weights are coefficients, and the output of which is a probability distribution over the next token.

    There's no hidden intelligence or man behind the curtain. Whatever a LLM can do, next token prediction is how it does it.

    • >The training and loss function of the foundation models are completely oriented towards next-token accuracy.

      This doesn't mean anything.

      Loss function and training only concern themselves with the result of the prediction. The in-between, the computation, training does not care except as a means to an end.

      It's not Input A > Output B. It's Input A > Computation > Output B.

      That Computation could quite literally be anything. And no, we do not automatically know what this computation might represent or if it even represents anything that would be understandable by us.

      If you train a Transformer meticulously on predicting the token that is the result of an addition, you might hope the computation it learns is some algorithm for addition but you wouldn't actually know until you attempted to and successfully probed the model.

      12 replies →