Comment by saithound
11 days ago
A Markov chain [1] is a discrete-time stochastic process, in which the value of each variable depends only on the value of the immediately preceding variable, and not any variables in the past.
LLMs are most definitely (discrete-time) Markov chains in this sense: the variables take their values in the context vectors, and the distribution of the new context window depends only on what was sampled previously context.
A Markov chain is a Markov chain, no matter how you implement it in a computer, whether as a lookup table, or an ordinary C function, or a one-layer neural net or a transformer.
LLMs and Markov text generators are technically both Markov chains, so some of the same math applies to both. But that's where the similarities end: e.g. the state space of an LLM is a context window, whereas the state space of a Markov text generator is usually an N-tuple of words.
And since the question here is how tiny LLMs differ from Markov text generators, the differences certainly matter here.
[1] https://en.wikipedia.org/wiki/Discrete-time_Markov_chain
>LLMs are most definitely (discrete-time) Markov chains in this sense: the variables take their values in the context vectors, and the distribution of the new context window depends only on what was sampled previously context.
When 'what was previously sampled context' can be arbitrarily long and complex and be of arbitrary modality, that's not a markov chain. That's just being funny with words. By that logic, humans are also a markov chain.
No, context windows are not arbitrarily long and complex. The set of possible context windows is a large finite set. The mathematical theory of Markov chains does not depend at all on what the elements of the state space set look like. The same math applies.
You argue LLMs are Markov chains because the context window is a 'large finite set.' But the physical configuration of the human brain is also a large finite set. We have a finite number of neurons and synaptic states; we do not possess infinite memory or infinite context.
Therefore, by your strict mathematical definition, a human is also a discrete-time Markov chain.
And that is exactly my point: If your definition is broad enough to group N-gram lookup tables, LLMs, and Human Beings into the same category, it is a useless category for this discussion. We are trying to distinguish between simple statistical generators and neural models. Pointing out that they both satisfy the Markov property is technically true, but structurally reductive to the point of absurdity.