Comment by ben_w

2 years ago

> This is what I've been saying to people, LLMs aren't much smarter than a Markov Chain

Then you're doing some combination of grossly overestimating the latter and/or underestimating the former.

A Markov Chain is a (small) language model with a context length of 1, never more, that's a definitional requirement of a Markov process.

Markov chains are defined in terms of states, which are not necessarily the same as tokens.

As an example, de Bruijn graphs have been widely used in bioinformatics. There are many definitions, but the key idea is that nodes are strings of length k, and there may be an edge from u to v if the length k-1 suffix of u is the same as the length k-1 prefix of v. It's trivial to turn a de Bruijn graph into a Markov chain with a context of k symbols.

If you have the Burrows-Wheeler transform of the training data, you can use it to simulate a Markov chain over a de Bruijn graph with any context length. You can also change the context length on the fly. I toyed a little bit with that idea when I was a grad student ~15 years ago, but nothing really came out of it. Building the model was already cheap back then, while getting data was the real bottleneck.

LLMs are basically lossy approximations of Markov chains. If you have a small accurate approximation of a larger model, the approximate model probably has the ability to generalize. And it turns out that good generalization over textual data looks like intelligent behavior.

  • > Markov chains are defined in terms of states, which are not necessarily the same as tokens.

    While true, I don't see how that helps? The example isn't particularly illuminating, yes you can make a Markov chain that encompasses all possible states in principle, but in practice you can't do that with a language model because you can't list all possible states (they mostly don't exist in the training set but even if they did there's no room to store them) and instead have to make a system which is, for lack of a non-anthropomorphic verbs, "trying to guess" the state — most of the cool stuff in the LLM is the system doing the guessing[0].

    > If you have the Burrows-Wheeler transform of the training data, you can use it to simulate a Markov chain over a de Bruijn graph with any context length. You can also change the context length on the fly. I toyed a little bit with that idea when I was a grad student ~15 years ago, but nothing really came out of it. Building the model was already cheap back then, while getting data was the real bottleneck.

    To me, the Burrows-Wheeler transform looks like a way to generate a compact tokenisation that doesn't make presumptions about data structure — while I think that's a good idea (and much better than the naive English language tokenisation of "anything that's not a letter or apostrophe" which IMO probably harmed early NLP), I don't think it's a good argument for drawing similarities between Transformer models (or even RNN models) and Markov models

    > LLMs are basically lossy approximations of Markov chains. If you have a small accurate approximation of a larger model, the approximate model probably has the ability to generalize. And it turns out that good generalization over textual data looks like intelligent behavior.

    I wonder if you're using "approximation" to mean something different than I would understand the term?

    You can map a Turing machine to a sufficiently large Markov chain whose probabilities are always 1 or 0, but that has the same problem: Turing machines and LLMs aren't lossy approximations of Markov chains, they're at least generalisations that only look the same when you construct an extreme form.

    Hopefully we both agree that it would be very weird to argue that "the language centres of human brains are basically just approximations of Markov chains"? Even though we can probably sufficiently well describe each neuron within a brain by a Markov chain?

    [0] This reminds me of my issue with John Searle's Chinese room argument: Searle denies the system truly understands Chinese because the human reading the books and following the instructions doesn't know Chinese, I would argue that the intelligence and understanding is encoded in the books, and that the entity lacking understanding is the substrate upon which the instructions are processed.

    To argue "the system as a whole can't be said to understand because the human doing the data processing doesn't understand" seems to me to map to saying "humans can't be said to understand because the brain cells doing the data processing don't understand", or even "… atoms conveying electrochemical signals across synapses don't understand".

    But I'm clearly digressing now, so I'm glad I put this in a footnote…