Markov Chains are the Original Language Models

2 years ago (elijahpotter.dev)

What's actually happening in a LLM is many orders of magnitude more complex than a Markov chain. However, I agree that they're an amazing pedagogical tool for the basic principles of how a LLM works, even to a non-technical audience.

Many people try to "explain" LLMs starting with the principles of neural networks. This rarely works well: there are some significant conceptual leaps required.

However, explaining that a LLMs are really just iterated next-word prediction based on a statistical model of the preceding words is something that most people can grok, and in a useful way: in my experience, it actually helps give people a useful intuition for why and how models hallucinate and what kind of things they're good/bad at.

Markov chains are super simple iterated next-word prediction model, and can be explained in 15 minutes. It's a great way to explain LLMs to laypeople.

  • HMMs are much more similar to S3, S4, and other deep state-space models than to transformers.

    In fact, HMMs are discrete shallow state-space models.

    I believe S4 and successors might become serious contenders to transformers.

    For a detailed tutorial, see https://srush.github.io/annotated-s4

  • >However, explaining that a LLMs are really just iterated next-word prediction based on a statistical model of the preceding words is something that most people can grok, and in a useful way: in my experience, it actually helps give people a useful intuition for why and how models hallucinate and what kind of things they're good/bad at.

    At risk of showing my deficient understanding: that isn't actually true, is it?[1]

    LLMs do much more than simply predict subsequent text, AFAICT. I think back to the earlier LLM results that wowed the world with stuff like "now I have a model where you can ask it 'king - man + woman' ... and it returns 'queen'. Trippy!"

    That is pretty clearly not mere text prediction, at least not identically. Even nearly all of the computational "hard work" comes from the math of predicting subsequent letters, that whole computation is introducing a new primitive, the concept of combining, or doing math on, models to produce other models, and thereby making a statement about what words would come next in a variety of scenarios.

    That primitive is not present in the Markov predictor example, at least not without positing how a Markov predictor would be similarly transformed -- which, being very ignorant on the matter, I'm not sure is possible or not, but either way, leaves out a critical construct that enables ChatGPT to e.g. find limericks that aren't preceded by the command "Write me a limerick", as in my earlier comment[1].

    [1] Earlier comment on why I think it's dubious to call LLM-based products such as ChatGPT "mere LMs": https://news.ycombinator.com/item?id=35472089

    • > That primitive is not present in the Markov predictor example

      It is. You need to go multi-dimensional; that's where "intelligence" emerges.

    • What you described was the vector space utility and mathematics which is the result of encoding it into a dimensional space (latent?).

      SOTA LLM use vector space dimensional embeddings in order to utilize a coordinate based intelligence “for free” with a vector space prediction mechanism on the index of context.

      1 reply →

    • My understanding is that LLMs are basically approximations of Markov chains where the state and probability distribution is thousands of words long. If you could directly compute and use that matrix, you'd get the same result. But that would be insane.

    • I'm pretty sure that is a property of the word2vec embeddings. I'm not 100% sure if the embeddings/hidden states in LLM's have the same property but my guess would be they do.

  • Genuine question: what do you mean by many orders of magnitude more complex?

    • I like your question, and I cannot answer it. But I have a benchmark: I can write a Markov chain "language model" in around 10-20 lines of Python, with zero external libraries -- with tokenization and "training" on a text file, and generating novel output. I wrote it in several minutes and didn't bother to save it.

      I'm curious how much time & code it would take to implement this LLM stuff at a similar level of quality and performance.

      2 replies →

    • A typical demonstration markov chain probably has a length of around 3. A typical recent LLM probably has more than three billion parameters. That's not precisely apppes to apples, but the LLM is certainly vastly more complicated.

      10 replies →

  • They are an especially useful tool right now, that might become less valuable as we get better at building LLMs. In principle the inner working of an LLM can be anything from a Markov-chain-like predictor to a beyond-human intelligence. Token prediction is the input/output format we chose, but you could communicate with a human in the same format and the human would show human-level intelligence.

    What makes Markov chains such a great pedagogic tool right now is that they share (approximately) the same interface, being a token predictor, and that current LLMs are much closer to the capabilities of a fantastically good Markov chain than those of an above-human intelligence.

    • > In principle the inner working of an LLM can be anything from a Markov-chain-like predictor to a beyond-human intelligence.

      I'm afraid I have to disagree. Next-token prediction isn't just the interface we use for LLMs, it is fundamentally what they are, to the very core. The training and loss function of the foundation models are completely oriented towards next-token accuracy.

      Reasonable people can disagree about emergent behavior and if/how much the model is "planning ahead" in its weights (and what that could even mean) but it is emphatically not the case that the "next token" model is "just an interface". The analogy to human thought isn't accurate at all: we have our own recursive/iterative thought process, short and long term memory, decision making loops, etc.

      A LLMs has no "thought" outside of next-token prediction and no working memory aside from its context window. We don't fully understand all the emergent behavior of a transformer model but we definitely understand exactly what's happening at the mechanical level: each token is determined, one at a time, by solving an extremely complex but deterministic equation in which the model weights are coefficients, and the output of which is a probability distribution over the next token.

      There's no hidden intelligence or man behind the curtain. Whatever a LLM can do, next token prediction is how it does it.

      13 replies →

Though it seems like there is some sensitivity around the comparison of LLMs to Markov chains, and certainly an LLM is not generated in the same way, it is pretty accurate that an LLM could be represented by a sufficiently (ie very) complex Markov chain. The states of the chain would not be the tokens themselves, as in this example, but the total context window vector of N input tokens, which would fan out to states consisting of the new context N[1:] + A, where A is any of the tokens sampled from the resulting probability distribution, and the transition probabilities are just drawn from that same distribution according to the temperature settings.

You could even do some very hand-wavy math on how staggeringly complex the resulting Markov chain would get: BERT for example has a token vocabulary of 30,000 and a context window of 512 tokens. So the number of possible states would be 30,000^512, or ~1.9 x 10^2292, with each of those having a max fan out to 30,000 other states. So clearly the LLM is a more compact representation of the concept.

  • It would be more accurate to say that a Markov chain is an example of a method that would perform relatively well at the same training task as a LLM.

    So too, might a human trying to predict the next tokens.

    But a human and a Markov chain are not the same underlying process to achieve next token prediction, and neither is the same underlying process as a LLM.

    • LLM are markov chains, a markov chain is a general concept and not just a text model technique. You must be thinking about the very simple markov chain models we had before where you just predicted the next word by looking up sentences with the same preceding words and picking a random of those words, that is also a markov chain just like LLM but a much simpler one, you are right LLMs aren't like that but they are still markov chains with the same kind of inputs and outputs as the old ones.

      7 replies →

It's true that Markov chains are very limited in their capabilities. But one thing I love about them is that they are one of the simplest and most intuitive ways to write code that *learns* from input data.

If you're never written something that *learns*, try it out! Here's a very primitive one I wrote recently to explain the basic idea and explains it along the way.

https://github.com/unoti/markov-basics/blob/main/markov-basi...

This starts with generating US city names, and ends with generating text based on the Dungeons and Dragons DM guide.

  • > that they are one of the simplest and most intuitive ways to write code that learns from input data

    Linear regression is significantly simpler and also learns from input data.

  • Thank you for sharing this! The implementation and examples were super intuitive.

I have no idea if the author is aware, but it's worth noting why a Markov chain has the name and what the difference is from other probabilistic models. The Markov property states that the probability distribution of the next state in some system depends only upon the current state.

Obviously, language does not have this property and this has been known from the start, but Markov models are extremely computationally tractable, easy to implement, and easy to understand, while doing a good enough job. Introducing recursion and recurrent layers into multilayer neural architectures allowed explicit modeling of variable-length past context for the current state, which is a more accurate representation of how language works, but these were quite expensive to train. Transformer models introduced the attention mechanism to simulate recurrence without explicitly encoding it, reducing the training cost to make it tractable to train equivalently capable models with larger parameter sets on larger training sets, giving us the large language model.

This ability to explicitly capture variable-length context lookback is what makes things like few-shot and one-shot learning possible. The probability distribution is effectively self-modifying at inference time. It's not quite like an animal brain. There is no true plasticity with the strict separation between training and inference, but it gets a lot closer than a Markov model.

You can see in a lot of the comments here about use cases people had for Markov models where they shine. If you're making a bot intended to model one specific topical forum on a single web site, context variance is reduced compared to trying to converse with and understand arbitrary people in arbitrary situations. You're able to capture the relevant context based on how you select your training data. In contrast, current LLMs allow you to train on all text you can find anywhere and the model will perform well in any context.

  • I've seen Markov models mentioned a lot in these context, and my generous take has always been that something like stacked Markov models is meant. At each abstraction layer, the state is conditioned only by the previous abstract concept. At the lowest level the states would be the sequence of tokens; higher up it's concepts like turn of events in a plot. I don't think this often proposed idea of hierarchy is sufficient to describe LLMs or human cognition, but it strikes at some essence about parsimony, efficient representation, and local computation.

> This a scratch text area to test the autocomplete in another moment down would be four thousand miles i wish you know please ma am is this new zealand or conversations in another moment down stairs how funny it

^ to me it’s really a stretch that someone could read that and think “markov chains are practically an llm!”

They’re pretty limited, and everyone working on building LLMs knows about them and has probably used them at some point.

They both generate tokens/words on probabilities, but it’s hard to understate how much simpler of a probability model a Markov chain is compared to an llm.

You’ve stated that markov chains have a limited use as a simple chat bot. How do you improve an mc so that it can perform other tasks, like classification, summarization, question answering, complex text parsing, text matching, etc, etc?

The language modeling done in llms has an open pathway to scaling and improving on this statistical model, and it’s incredible to me that this is still opaque to people. I don’t believe that llms are a path to intelligence, but I can certainly recognize that they’re going to emerge as a sophisticated and very useful productivity tool.

  • > This is where I am right now. Want to go back to my roots. Some people work on old cars, even though they are less efficient. At the same time though, they are more fun to work on than new cars. I've decided to look into Markov chains.

    Not sure there’s another side to your argument - we all agree :).

    To the author: keep the faith and the excitement if you can, the actual applications of this technology (beyond chatbots) are just getting started.

Sure, LLMs are like 1024-gram Markov chains (or whatever the context limit is). But there are problems: 1) the transition matrix is far too huge to represent, and 2) it treats all pairs of 1024-grams as completely different, even if the first 1023 words are the same.

Function approximation solves both issues, and the Transformer is the best function clas we've found so far.

Markov chains are fun. I often use them when teaching a Python fundamentals course. You can create an implementation in around 100 lines of code that explores many features of the language: classes, functions, loops, dictionaries, and lists. Then, you can augment with tests, a command line app, typing, etc.

Markov Chains are cool stuff. Back in 2015 in one of my senior class projects, I built a haiku bot that would take a seed word and make a haiku from a corpus of Tweets I collected from the firehose API.

  • Intriguing. How did you get it to fit the syllable count?

    • I forget now and can't find the code, but if I recall correctly, I had each tokenized word map to its syllables and used that in calculating the next word.

      I wasn't able to get far enough to make each haiku make perfect grammatical sense, but it very often produced interesting results.

This is what I've been saying to people, LLMs aren't much smarter than a Markov Chain - a lot of twitter bots and earlier chat agents were driven by them.

What I do see though is the infrastructure around LLMs making a difference.

  • I've never seen a Markov chain do anything like GPT4. I'm not sure how you can say with a straight face they are basically the same.

    • A LLM is a Markov chain with billions of associations and weights. A Makov chain is an LLM of maybe a few dozen associations and weights (so an LM, without the first L).

      The difference is in the data structure and the size of the atoms/n-grams. The data structure Markov chain implementations use is not efficient for billions of parameters, either in storage or in processing. But the idea is the same: give a likely next token given the last n tokens. The value for n is a narrow window for Markov chains and an extremely wide window for LLM. LLM are able to maintain massive amounts of state compared to a Markov chain implementation.

      18 replies →

    • LLMs are Markov chains in latent space, it's the latent representation that give them their power, but ultimately there's not as much difference as one would suspect.

      2 replies →

    • Fair, but maybe it's more of a computer science-cy type of comparison?

      We say systems can perform the same types of computations if they're both Turing complete. Yet, we wouldn't implement everything in every "language" that is Turing complete.

      Perhaps, every LLM could be represented as a Markov chain, and for some it even makes sense (e.g., easier to train, easier to reason about), but in most cases it's a bad idea (e.g., expensive, bad performance).

  • The big difference is the "deep" in "deep learning".

    As the article says, Markov chains can be thought of as linear operations, which are very limited, and the reason early neural networks weren't taken seriously. Notoriously, you couldn't implement a XOR operation using these early neural nets.

    It all changed when we got to multi-layer perceptrons, with backpropagation. The key here is the transfer function, which is non-linear, which allows the network to do infinitely more what you can achieve with simple linear combinations. And it is also differentiable, which allows for learning using backpropagation. This is the theoretical foundation behind LLMs, diffusion models, classifiers, etc... Essentially everything we call "AI" today.

    So, yes, by being non-linear, LLMs are deeply smarter than Markov chains (pun intended).

  • >This is what I've been saying to people, LLMs aren't much smarter than a Markov Chain

    They're based on the same principles, but extend quite beyond that, and are way smarter than Markov Chains.

    So, it's not just the infrastructure that's makes the difference (if they were only as effective as Markov Chains, then having the same or even more infrastructure wouldn't make any difference).

  • This is what I've been saying to people, LLMs aren't much smarter than a Markov Chain

    And that's why people have been nodding, smiling politely and reaching behind their back for the doorknob.

  • The whole point of Transformers is that they broke the Markov assumption (i.e., that the next token probability is strictly conditioned on a window of N preceding tokens).

    • That's not actually true, they still have a fixed history window. The idea that transformers capture through the attention mechanism is that not all past tokens are created equal, and that the importance of tokens in that history window depends on what they are.

      2 replies →

  • > This is what I've been saying to people, LLMs aren't much smarter than a Markov Chain

    Then you're doing some combination of grossly overestimating the latter and/or underestimating the former.

    A Markov Chain is a (small) language model with a context length of 1, never more, that's a definitional requirement of a Markov process.

    • Markov chains are defined in terms of states, which are not necessarily the same as tokens.

      As an example, de Bruijn graphs have been widely used in bioinformatics. There are many definitions, but the key idea is that nodes are strings of length k, and there may be an edge from u to v if the length k-1 suffix of u is the same as the length k-1 prefix of v. It's trivial to turn a de Bruijn graph into a Markov chain with a context of k symbols.

      If you have the Burrows-Wheeler transform of the training data, you can use it to simulate a Markov chain over a de Bruijn graph with any context length. You can also change the context length on the fly. I toyed a little bit with that idea when I was a grad student ~15 years ago, but nothing really came out of it. Building the model was already cheap back then, while getting data was the real bottleneck.

      LLMs are basically lossy approximations of Markov chains. If you have a small accurate approximation of a larger model, the approximate model probably has the ability to generalize. And it turns out that good generalization over textual data looks like intelligent behavior.

      1 reply →

  • Yeah, and Markov chains aren't much smarter than... matrix multiplication. A lot of stuff have been built with it, but these pesky c++ libs like lapack — such a bother to use! Imagine a world where a good infrastructure have been built around it — matrix multiplication certainly would be a blast!

Markov chains were one of the coolest discoveries of my programming career and I spent years using them to make forum and social media bots trained on people’s post histories for them.

I think that experience is part of why I’ve been generally unimpressed with a lot of LLM hype. Like yeah, it’s cool and definitely more useful than a markov chain - but for the amount of resources that went into it I’d expect the gap to be quite a bit larger than it really is.

  • I also loved when I found about Markov chains and had a lot of fun playing with them, but have a totally different view on the gap with respect to LLMs.

    Markov chains were discovered in 1906. Since then until a few years ago, advances on "building a better Markov chain" have been modest (e.g. smoothing techniques).

    Enter the last 5 years, LLMs come and now you have an "uber Markov chain" that actually generates perfect syntactically coherent text, you can even ask it things and if the question is well-posed and makes sense you will get a true answer at least the majority of the time, they can be a daily tool (for practical purposes, beyond fun), help you solve problems and write interesting creative stories. A much larger leap in those 5 years than in the previous century!

    I see them as what I always dreamed Markov chains to be, but they couldn't be. The gap is huge.

  • > I think that experience is part of why I’ve been generally unimpressed with a lot of LLM hype.

    There are fundamental scale & architectural differences that make LLMs different from Markov Chains: It's like saying that you're not impressed by indoor plumbing because you have experience carrying your water from a well, and that both do the same thing - transporting water to your home.

    In both cases, this line of logic ignores the improvements made to make such a thing remotely possible, and the difference in relative usefulness that can be gained from LLMs in comparison to Markov Chains.

  • I first learned about Markov chains in the late 90s when I ran across a Markov chain bot on IRC. The person who wrote it was nice and answered my questions about it. I managed to write my own shitty copy but was pleased enough with myself. Same as you I thought it was one of the coolest things I came across in programming.

    Also same as you why I've been very cool to LLM hype. Not that LLMs aren't feats of their own but they're not nearly as smart as their hype suggests.

  • Gap between GPT-4 and Markov Chains is huge though. Even gap between GPT-4 and GPT-3.5 seems obviously huge to me in terms of what they are able to do.

LLM with temperature 0 will always return the same output for the same input.

Considering this, LLMs are Markov Chains, its just the output sequence as a whole can be considered as an element and whole context can be considered as "one previous element".

So, the whole block of the text on input is previous element, and whole block of the text on the output is the next element.

Isn't it?

If we let Markov model mean a probability distribution over a sequence defined as P[ X(t) | X(t-1) ] , then a transformer is specifically not a Markov model. The Markov property means each element is conditional on only the previous element ("context length" = 1).

A discrete table based probability distribution over possible words, as in the link, is a Markov language model, but not a meaningful contestant among Markov models for language modelling. The latest contestants here are things like RWKV, ResNet, SSMs, and so on which might be better thought of as HMMs.

An HMM is a Markov chain of hidden ("latent") states, rather than over tokens or words. Variations have been used for speech recognition and language modelling for 20-30 years.

Claude Shannon showed you could use n-grams to represent language. Then Chomsky invented automata / formal languages theory to show why it couldn’t work.

IIRC transformers are at the very bottom of the Chomsky hierarchy, and yet they are clearly able to master English grammar.

What gives?

Someone wrote a Markov chaining bot back on LambdaMOO in the early 90s, that would pretend to participate in conversation in the room, and I remember being blown away by it. Especially in the context of the limited computation power on that platform and the machine it was hosted on.

I had an exam at Uni where one of the problem was related to Markov Chains, brings back memories.

Also, for a lab we've used Markov Chains to create 'music'. The results were terrible, but the exercise was good.

I agree! I'm a undergrad and I used Markov chains to explain how I'm able to predict what my colleagues are going to say! (BTW I'm a chess player so that's how I know what it is :)

It could be me not understanding something but:

> Since there is a 25% probability Alice is at the grocery store, we multiply that with the probility of her transitioning to the Planetarium: 25% ∗ 75%

Shouldn't this be 25% * 70%?

In 2016 I created a Twitter bot that replicated a political activist via a Markov chain model. It was impressive how many people liked those post, also a middle level politician.

  • Can you link to that Twitter bot? I'm just curious about the quality of the posts it generated.

LLMs are a reincarnation of Markov Chains, which, by the way, were surprisingly underutilized in the past.

Back around 2002 we built an IRC bot that logged everything in a channel, built up a frequency table of which word follows what word and then randomly would spew out some text using a random walk of the frequency tables and had some logic for when to stop (I think probability increased with sentence length).

Sometimes it came up with very funny stuff, but mostly non-sense.