Comment by famouswaffles
4 days ago
Yes, technically you can frame an LLM as a Markov chain by defining the "state" as the entire sequence of previous tokens. But this is a vacuous observation under that definition, literally any deterministic or stochastic process becomes a Markov chain if you make the state space flexible enough. A chess game is a "Markov chain" if the state includes the full board position and move history. The weather is a "Markov chain" if the state includes all relevant atmospheric variables.
The problem is that this definition strips away what makes Markov models useful and interesting as a modeling framework. A “Markov text model” is a low-order Markov model (e.g., n-grams) with a fixed, tractable state and transitions based only on the last k tokens. LLMs aren’t that: they model using un-fixed long-range context (up to the window). For Markov chains, k is non-negotiable. It's a constant, not a variable. Once you make it a variable, near any process can be described as markovian, and the word is useless.
Sure many things can be modelled as Markov chains, which is why they're useful. But it's a mathematical model so there's no bound on how big the state is allowed to be. The only requirement is that all you need is the current state to determine the probabilities of the next state, which is exactly how LLMs work. They don't remember anything beyond the last thing they generated. They just have big context windows.
The etymology of the "markov property" is that the current state does not depend on history.
And in classes, the very first trick you learn to skirt around history is to add Boolean variables to your "memory state". Your systems now model, "did it rain The previous N days?" The issue obviously being that this is exponential if you're not careful. Maybe you can get clever by just making your state a "sliding window history", then it's linear in the number of days you remember. Maybe mix the both. Maybe add even more information .Tradeoffs, tradeoffs.
I don't think LLMs embody the markov property at all, even if you can make everything eventually follow the markov property by just "considering every single possible state". Of which there are (size of token set)^(length) states at minimum because of the KV cache.
The KV cache doesn't affect it because it's just an optimization. LLMs are stateless and don't take any other input than a fixed block of text. They don't have memory, which is the requirement for a Markov chain.
2 replies →
>Sure many things can be modelled as Markov chains
Again, no they can't, unless you break the definition. K is not a variable. It's as simple as that. The state cannot be flexible.
1. The markov text model uses k tokens, not k tokens sometimes, n tokens other times and whatever you want it to be the rest of the time.
2. A markov model is explcitly described as 'assuming that future states depend only on the current state, not on the events that occurred before it'. Defining your 'state' such that every event imaginable can be captured inside it is a 'clever' workaround, but is ultimately describing something that is decidedly not a markov model.
It's not n sometimes, k tokens some other times. LLMs have fixed context windows, you just sometimes have less text so it's not full. They're pure functions from a fixed size block of text to a probability distribution of the next character, same as the classic lookup table n gram Markov chain model.
10 replies →