Yea, if it were true, then Markov chains on steroids (LLMs) would be super at problem solving, but they can't even reason. And they get worse if you add random facts about cats: https://news.ycombinator.com/item?id=44724238
LLMs aren't Markov chains unless they have a context window of 1.
>In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event
The tricky thing is you get to define the state. So if the "state" is the current word _and_ the previous 10 it is still "memoryless". So an LLM's context window is the state. It doesn't matter whether _we_ see parts of the state as called history, the markov chain doesn't care (they are all just different features).
Edit: I could be missing important nuance that other people are pointing out in this thread!
It's not an unreasonable view, at least for decoder-only LLMs (which is what most popular LLMs are). While it may seem they violate the Markov property since they clearly do make use of their history, in practice that entire history is summarized in an embedding passed into the decoder. I.e.just like a Markov chain their entire history is compressed into a single point which leaves them conditionally independent of their past given their present state.
It's worth noting that this claim is NOT generally applicable to LLMs since both encoder/decoder and encoder-only LLMs do violate the Markov property and therefore cannot be properly considered Markov chains in a meaningful way.
But running inference on decoder only model is, at a high enough level of abstraction, is conceptually the same as running a Markov chain (on steroids).
A Markov process is any process where if you have perfect information on the current state, you cannot gain more information about the next state by looking at any previous state.
Physics models of closed systems moving under classical mechanics are deterministic, continuous Markov processes. Random walks on a graph are non deterministic, discrete Markov processes.
You may further generalize that if a process has state X, and the prior N states contribute to predicting the next state, you can make a new process whose state is an N-vector of Xs, and the graph connecting those states reduces the evolution of the system to a random walk on a graph, and thus a Markov process.
Thus any system where the best possible model of its evolution requires you to examine at most finitely many consecutive states immediately preceding the current state is a Markov process.
For example, an LLM that will process a finite context window of tokens and then emit a weighted random token is most definitely a Markov process.
Yea, if it were true, then Markov chains on steroids (LLMs) would be super at problem solving, but they can't even reason. And they get worse if you add random facts about cats: https://news.ycombinator.com/item?id=44724238
That doesn’t follow at all, but I guess don’t pass up any opportunity to bash LLMs.
LLMs aren't Markov chains unless they have a context window of 1.
>In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event
The tricky thing is you get to define the state. So if the "state" is the current word _and_ the previous 10 it is still "memoryless". So an LLM's context window is the state. It doesn't matter whether _we_ see parts of the state as called history, the markov chain doesn't care (they are all just different features).
Edit: I could be missing important nuance that other people are pointing out in this thread!
LLMs are Markov Chains on steroids?
By the way, does anyone know which model or type of model was used in winning gold in IMO?
> LLMs are Markov Chains on steroids?
It's not an unreasonable view, at least for decoder-only LLMs (which is what most popular LLMs are). While it may seem they violate the Markov property since they clearly do make use of their history, in practice that entire history is summarized in an embedding passed into the decoder. I.e.just like a Markov chain their entire history is compressed into a single point which leaves them conditionally independent of their past given their present state.
It's worth noting that this claim is NOT generally applicable to LLMs since both encoder/decoder and encoder-only LLMs do violate the Markov property and therefore cannot be properly considered Markov chains in a meaningful way.
But running inference on decoder only model is, at a high enough level of abstraction, is conceptually the same as running a Markov chain (on steroids).
A Markov process is any process where if you have perfect information on the current state, you cannot gain more information about the next state by looking at any previous state.
Physics models of closed systems moving under classical mechanics are deterministic, continuous Markov processes. Random walks on a graph are non deterministic, discrete Markov processes.
You may further generalize that if a process has state X, and the prior N states contribute to predicting the next state, you can make a new process whose state is an N-vector of Xs, and the graph connecting those states reduces the evolution of the system to a random walk on a graph, and thus a Markov process.
Thus any system where the best possible model of its evolution requires you to examine at most finitely many consecutive states immediately preceding the current state is a Markov process.
For example, an LLM that will process a finite context window of tokens and then emit a weighted random token is most definitely a Markov process.
> LLMs are Markov Chains on steroids?
Might be a reference to this[1] blog post which was posted here[2] a year ago.
There has also been some academic work linking the two, like this[3] paper.
[1]: https://arxiv.org/abs/2410.02724