Comment by kromem
2 years ago
No, the self-attention for the transformer in GPT means it isn't a Markov Chain.
A blog post of you want to read more:
https://medium.com/@andrew_johnson_4/are-transformers-markov...
2 years ago
No, the self-attention for the transformer in GPT means it isn't a Markov Chain.
A blog post of you want to read more:
https://medium.com/@andrew_johnson_4/are-transformers-markov...
Isn't it though, if you consider the entire context to be part of the state? It seems like his argument is based on an assumption of the Markov model only using the current word as its state.
It's not the only one stating this:
https://safjan.com/understanding-differences-gpt-transformer...
If we're broadening the scope of a Markov chain to consider the entire system as the current state that's being used to determine the next step of operation, then isn't literally every computer program a Markov chain under that definition?
You can't just include the memory of previous states which the current state is depending on as being part of the "current state" to fit the definition of a Markov chain without having broadened the scope to the point the definition becomes meaningless.
No, every computer program is not a Markov chain, because computer programs aren't random processes.
Look up the concepts of "higher-order Markov chains" or "Markov chains with memory".
Obviously, there are reasons that people don't directly implement LLMs as very high-order Markov chains. But it doesn't mean that the equivalence isn't mathematically or intuitively useful.
I'm just a non expert, but as far as I can tell, if there's an upper bound to the lookback in the context window, I don't see how that transformer isn't strictly less powerful than a Markov chain of that order. It's just a computationally tractable way to closely approximate it.
If the potential lookback distance is unbounded, then I think you could say that it's different from a finite-order Markov chain.
2 replies →