Comment by tanepiper

2 years ago

This is what I've been saying to people, LLMs aren't much smarter than a Markov Chain - a lot of twitter bots and earlier chat agents were driven by them.

What I do see though is the infrastructure around LLMs making a difference.

I've never seen a Markov chain do anything like GPT4. I'm not sure how you can say with a straight face they are basically the same.

  • A LLM is a Markov chain with billions of associations and weights. A Makov chain is an LLM of maybe a few dozen associations and weights (so an LM, without the first L).

    The difference is in the data structure and the size of the atoms/n-grams. The data structure Markov chain implementations use is not efficient for billions of parameters, either in storage or in processing. But the idea is the same: give a likely next token given the last n tokens. The value for n is a narrow window for Markov chains and an extremely wide window for LLM. LLM are able to maintain massive amounts of state compared to a Markov chain implementation.

    • “A human brain is just like a dog’s brain, only with more neural pathways.” True, perhaps, but largely pointless: at some point neural complexity results in a difference of kind, not of degree.

      I’d argue the same is true of LLMs vs simpler models like Markov chains.

      3 replies →

    • Idea behind complexity can be very simple, but at scale work yield in very different results.

      To compare Markov Chain with an LLM is kind of like to compare a single cell organism to a human being because we both are based on cells.

      3 replies →

    • It's not just window size. It's the difference between syntax and semantics.

      A Markov model, by definition, works only with literal token histories. It can't participate meaningfully in a conversation unless the user happens to employ token sequences that the model has seen before (ideally multiple times.) An LLM can explain why it's not just a Markov model, but the converse isn't true.

      Now, if you were to add high-dimensional latent-space embedding to a Markov model, that would make the comparison more meaningful, and would allow tractable computation with model sizes that were completely impossible to deal with before. But then it wouldn't be a Markov model anymore. Or, rather, it would still be a Markov model, but one that's based on relationships between tokens rather than just their positions in a linear list.

      Another analogy might be to say that a Markov model can implement lossless compression only, while a latent-space model can implement lossy compression. There's a school of thought that says that lossy compression doesn't just require intelligence, it is intelligence, and LLMs can be seen as an example of that equivalence. Not saying I agree with that school, or that you should, but as someone else pointed out, comparing Markov chains with LLMs are at best like comparing goldfish brains with human brains.

      4 replies →

  • Fair, but maybe it's more of a computer science-cy type of comparison?

    We say systems can perform the same types of computations if they're both Turing complete. Yet, we wouldn't implement everything in every "language" that is Turing complete.

    Perhaps, every LLM could be represented as a Markov chain, and for some it even makes sense (e.g., easier to train, easier to reason about), but in most cases it's a bad idea (e.g., expensive, bad performance).

  • No one has spent 100M on training Markov chains.

    • The trick with Markov chains is that you don't need to.

      Markov Chains are dead simple. There's not really a "training" as much as it's simply reading data and collecting statistics.

      They're so simple that you can probably build one nearly as fast as you can read the training data.

    • I think it would take some spectacularly bad engineering to be that wasteful. It would need to be so inefficient that getting ChatGPT to write the code won't be bad enough.

The big difference is the "deep" in "deep learning".

As the article says, Markov chains can be thought of as linear operations, which are very limited, and the reason early neural networks weren't taken seriously. Notoriously, you couldn't implement a XOR operation using these early neural nets.

It all changed when we got to multi-layer perceptrons, with backpropagation. The key here is the transfer function, which is non-linear, which allows the network to do infinitely more what you can achieve with simple linear combinations. And it is also differentiable, which allows for learning using backpropagation. This is the theoretical foundation behind LLMs, diffusion models, classifiers, etc... Essentially everything we call "AI" today.

So, yes, by being non-linear, LLMs are deeply smarter than Markov chains (pun intended).

>This is what I've been saying to people, LLMs aren't much smarter than a Markov Chain

They're based on the same principles, but extend quite beyond that, and are way smarter than Markov Chains.

So, it's not just the infrastructure that's makes the difference (if they were only as effective as Markov Chains, then having the same or even more infrastructure wouldn't make any difference).

This is what I've been saying to people, LLMs aren't much smarter than a Markov Chain

And that's why people have been nodding, smiling politely and reaching behind their back for the doorknob.

The whole point of Transformers is that they broke the Markov assumption (i.e., that the next token probability is strictly conditioned on a window of N preceding tokens).

  • That's not actually true, they still have a fixed history window. The idea that transformers capture through the attention mechanism is that not all past tokens are created equal, and that the importance of tokens in that history window depends on what they are.

    • They also scale differently - Markov Chains scale exponentially with the size of the window, while transformers scale quadratically. So in fact transformers are really more exponentially more efficient, though without bound on resources their capabilities are a strict subset of that of Markov chain.

> This is what I've been saying to people, LLMs aren't much smarter than a Markov Chain

Then you're doing some combination of grossly overestimating the latter and/or underestimating the former.

A Markov Chain is a (small) language model with a context length of 1, never more, that's a definitional requirement of a Markov process.

  • Markov chains are defined in terms of states, which are not necessarily the same as tokens.

    As an example, de Bruijn graphs have been widely used in bioinformatics. There are many definitions, but the key idea is that nodes are strings of length k, and there may be an edge from u to v if the length k-1 suffix of u is the same as the length k-1 prefix of v. It's trivial to turn a de Bruijn graph into a Markov chain with a context of k symbols.

    If you have the Burrows-Wheeler transform of the training data, you can use it to simulate a Markov chain over a de Bruijn graph with any context length. You can also change the context length on the fly. I toyed a little bit with that idea when I was a grad student ~15 years ago, but nothing really came out of it. Building the model was already cheap back then, while getting data was the real bottleneck.

    LLMs are basically lossy approximations of Markov chains. If you have a small accurate approximation of a larger model, the approximate model probably has the ability to generalize. And it turns out that good generalization over textual data looks like intelligent behavior.

    • > Markov chains are defined in terms of states, which are not necessarily the same as tokens.

      While true, I don't see how that helps? The example isn't particularly illuminating, yes you can make a Markov chain that encompasses all possible states in principle, but in practice you can't do that with a language model because you can't list all possible states (they mostly don't exist in the training set but even if they did there's no room to store them) and instead have to make a system which is, for lack of a non-anthropomorphic verbs, "trying to guess" the state — most of the cool stuff in the LLM is the system doing the guessing[0].

      > If you have the Burrows-Wheeler transform of the training data, you can use it to simulate a Markov chain over a de Bruijn graph with any context length. You can also change the context length on the fly. I toyed a little bit with that idea when I was a grad student ~15 years ago, but nothing really came out of it. Building the model was already cheap back then, while getting data was the real bottleneck.

      To me, the Burrows-Wheeler transform looks like a way to generate a compact tokenisation that doesn't make presumptions about data structure — while I think that's a good idea (and much better than the naive English language tokenisation of "anything that's not a letter or apostrophe" which IMO probably harmed early NLP), I don't think it's a good argument for drawing similarities between Transformer models (or even RNN models) and Markov models

      > LLMs are basically lossy approximations of Markov chains. If you have a small accurate approximation of a larger model, the approximate model probably has the ability to generalize. And it turns out that good generalization over textual data looks like intelligent behavior.

      I wonder if you're using "approximation" to mean something different than I would understand the term?

      You can map a Turing machine to a sufficiently large Markov chain whose probabilities are always 1 or 0, but that has the same problem: Turing machines and LLMs aren't lossy approximations of Markov chains, they're at least generalisations that only look the same when you construct an extreme form.

      Hopefully we both agree that it would be very weird to argue that "the language centres of human brains are basically just approximations of Markov chains"? Even though we can probably sufficiently well describe each neuron within a brain by a Markov chain?

      [0] This reminds me of my issue with John Searle's Chinese room argument: Searle denies the system truly understands Chinese because the human reading the books and following the instructions doesn't know Chinese, I would argue that the intelligence and understanding is encoded in the books, and that the entity lacking understanding is the substrate upon which the instructions are processed.

      To argue "the system as a whole can't be said to understand because the human doing the data processing doesn't understand" seems to me to map to saying "humans can't be said to understand because the brain cells doing the data processing don't understand", or even "… atoms conveying electrochemical signals across synapses don't understand".

      But I'm clearly digressing now, so I'm glad I put this in a footnote…

Yeah, and Markov chains aren't much smarter than... matrix multiplication. A lot of stuff have been built with it, but these pesky c++ libs like lapack — such a bother to use! Imagine a world where a good infrastructure have been built around it — matrix multiplication certainly would be a blast!