← Back to context

Comment by empiko

4 days ago

LLMs are indeed Markov chains. The breakthrough is that we are able to efficiently compute well performing probabilities for many states using ML.

LLMs are not Markov Chains unless you contort the meaning of a Markov Model State so much you could even include the human brain.

  • Not sure why that's contorting, a markov model is anything where you know the probability of going from state A to state B. The state can be anything. When it's text generation the state is previous text to text with an extra character, which is true for both LLMs and oldschool n-gram markov models.

    • A GPT model would be modelled as an n-gram Markov model where n is the size of the context window. This is slightly useful for getting some crude bounds on the behaviour of GPT models in general, but is not a very efficient way to store a GPT model.

      2 replies →

    • Yes, technically you can frame an LLM as a Markov chain by defining the "state" as the entire sequence of previous tokens. But this is a vacuous observation under that definition, literally any deterministic or stochastic process becomes a Markov chain if you make the state space flexible enough. A chess game is a "Markov chain" if the state includes the full board position and move history. The weather is a "Markov chain" if the state includes all relevant atmospheric variables.

      The problem is that this definition strips away what makes Markov models useful and interesting as a modeling framework. A “Markov text model” is a low-order Markov model (e.g., n-grams) with a fixed, tractable state and transitions based only on the last k tokens. LLMs aren’t that: they model using un-fixed long-range context (up to the window). For Markov chains, k is non-negotiable. It's a constant, not a variable. Once you make it a variable, near any process can be described as markovian, and the word is useless.

      17 replies →

  • Well LLMs aren't human brains, unless you contort the definition of matrix algebra so much you could even include them.

    • QM and GR can be written as matrix algebra, atoms and electrons are QM, chemistry is atoms and electrons, biology is chemistry, brains are biology.

      An LLM could be implemented with a Markov chain, but the naïve matrix is ((vocab size)^(context length))^2, which is far too big to fit in this universe.

      Like, the Bekenstein bound means writing the transition matrix for an LLM with just 4k context (and 50k vocabulary) at just one bit resolution, the first row (out of a bit more than 10^18795 rows) ends up with a black hole >10^9800 times larger than the observable universe.

      1 reply →

Markov models with more than 3 words as "context window" produce very unoriginal text in my experience (corpus used had almost 200k sentences, almost 3 million words), matching the OP's experience. These are by no means large corpuses, but I know it isn't going away with a larger corpus.[1] The Markov chain will wander into "valleys" of reproducing paragraphs of its corpus one for one because it will stumble upon 4-word sequences that it has only seen once. This is because 4 words form a token, not a context window. Markov chains don't have what LLMs have.

If you use a syllable-level token in Markov models the model can't form real words much beyond the second syllable, and you have no way of making it make more sense other than increasing the token size, which exponentially decreases originality. This is the simplest way I can explain it, though I had to address why scaling doesn't work.

[1] There are 4^400000 possible 4-word sequences in English (barring grammar) meaning only a corpus with 8 times that amount of words and with no repetition could offer two ways to chain each possible 4 word sequence.

Yeah, there's only two differences between using Markov chains to predict words and LLMs:

* LLMs don't use Markov chains, * LLMs don't predict words.

They are definitely not Markov Chains they may, however, be Markov Models. There's a difference between MC and MM.

  • What do you mean? The states are fully observable (current array of tokens), and using an LLM we calculate the probabilities of moving between them. What is not MC about this?

    • I suggest getting familiar with or brushing up on the differences between a Markov Chain and a Markov Model. The former is a substantial restriction of the latter. The classic by Kemeny and Snell is a good readable reference.

      MC have constant and finite context length, their state is the most recent k tuple of emitted alphabets and transition probabilities are invariant (to time and tokens emitted)

      4 replies →