Comment by acjohnson55
2 years ago
Isn't it though, if you consider the entire context to be part of the state? It seems like his argument is based on an assumption of the Markov model only using the current word as its state.
2 years ago
Isn't it though, if you consider the entire context to be part of the state? It seems like his argument is based on an assumption of the Markov model only using the current word as its state.
It's not the only one stating this:
https://safjan.com/understanding-differences-gpt-transformer...
If we're broadening the scope of a Markov chain to consider the entire system as the current state that's being used to determine the next step of operation, then isn't literally every computer program a Markov chain under that definition?
You can't just include the memory of previous states which the current state is depending on as being part of the "current state" to fit the definition of a Markov chain without having broadened the scope to the point the definition becomes meaningless.
No, every computer program is not a Markov chain, because computer programs aren't random processes.
Look up the concepts of "higher-order Markov chains" or "Markov chains with memory".
Obviously, there are reasons that people don't directly implement LLMs as very high-order Markov chains. But it doesn't mean that the equivalence isn't mathematically or intuitively useful.
I'm just a non expert, but as far as I can tell, if there's an upper bound to the lookback in the context window, I don't see how that transformer isn't strictly less powerful than a Markov chain of that order. It's just a computationally tractable way to closely approximate it.
If the potential lookback distance is unbounded, then I think you could say that it's different from a finite-order Markov chain.
The Markov chain by definition satisfies the Markov property (https://en.m.wikipedia.org/wiki/Markov_property).
If you want a more in depth sense of why it's not, look at the following link that describes the transformer architecture in detail and specifically goes over moving from a Markov process to one that no longer satisfies it, specifically around the following paragraph:
> The first thing that becomes apparent is that, when trying to predict the word that comes after ran, we no longer look at just one line, but rather a whole set of them. We've moved out of the Markov realm now. Each row no longer represents the state of the sequence at a particular point. Instead, each row represents one of many features that may describe the sequence at a particular point. The combination of the most recent word with each of the words that came before makes for a collection of applicable rows, maybe a large collection. Because of this change in meaning, each value in the matrix no longer represents a probability, but rather a vote. Votes will be summed and compared to determine next word predictions.
- https://e2eml.school/transformers.html
1 reply →