Comment by lsy

2 years ago

Though it seems like there is some sensitivity around the comparison of LLMs to Markov chains, and certainly an LLM is not generated in the same way, it is pretty accurate that an LLM could be represented by a sufficiently (ie very) complex Markov chain. The states of the chain would not be the tokens themselves, as in this example, but the total context window vector of N input tokens, which would fan out to states consisting of the new context N[1:] + A, where A is any of the tokens sampled from the resulting probability distribution, and the transition probabilities are just drawn from that same distribution according to the temperature settings.

You could even do some very hand-wavy math on how staggeringly complex the resulting Markov chain would get: BERT for example has a token vocabulary of 30,000 and a context window of 512 tokens. So the number of possible states would be 30,000^512, or ~1.9 x 10^2292, with each of those having a max fan out to 30,000 other states. So clearly the LLM is a more compact representation of the concept.

It would be more accurate to say that a Markov chain is an example of a method that would perform relatively well at the same training task as a LLM.

So too, might a human trying to predict the next tokens.

But a human and a Markov chain are not the same underlying process to achieve next token prediction, and neither is the same underlying process as a LLM.

  • LLM are markov chains, a markov chain is a general concept and not just a text model technique. You must be thinking about the very simple markov chain models we had before where you just predicted the next word by looking up sentences with the same preceding words and picking a random of those words, that is also a markov chain just like LLM but a much simpler one, you are right LLMs aren't like that but they are still markov chains with the same kind of inputs and outputs as the old ones.