Comment by acjohnson55
2 years ago
No, every computer program is not a Markov chain, because computer programs aren't random processes.
Look up the concepts of "higher-order Markov chains" or "Markov chains with memory".
Obviously, there are reasons that people don't directly implement LLMs as very high-order Markov chains. But it doesn't mean that the equivalence isn't mathematically or intuitively useful.
I'm just a non expert, but as far as I can tell, if there's an upper bound to the lookback in the context window, I don't see how that transformer isn't strictly less powerful than a Markov chain of that order. It's just a computationally tractable way to closely approximate it.
If the potential lookback distance is unbounded, then I think you could say that it's different from a finite-order Markov chain.
The Markov chain by definition satisfies the Markov property (https://en.m.wikipedia.org/wiki/Markov_property).
If you want a more in depth sense of why it's not, look at the following link that describes the transformer architecture in detail and specifically goes over moving from a Markov process to one that no longer satisfies it, specifically around the following paragraph:
> The first thing that becomes apparent is that, when trying to predict the word that comes after ran, we no longer look at just one line, but rather a whole set of them. We've moved out of the Markov realm now. Each row no longer represents the state of the sequence at a particular point. Instead, each row represents one of many features that may describe the sequence at a particular point. The combination of the most recent word with each of the words that came before makes for a collection of applicable rows, maybe a large collection. Because of this change in meaning, each value in the matrix no longer represents a probability, but rather a vote. Votes will be summed and compared to determine next word predictions.
- https://e2eml.school/transformers.html
The author made the choice to depart from a literally implementing a Markov model. That's a practical choice, but not strictly necessary if you're not worried about practicality. I think you're getting hung up on engineering decisions made to actually implement a language model, versus the theory.
In the same section, the author says:
> The difference between this second-order-with-skips and a full umpteenth-order model is that we discard most of the word order information and combinations of preceeeding words. What remains is still pretty powerful.
This implies that sure, you could hypothetically do an umpteenth-order model, but dropping down to something that approximates it "is still pretty powerful".
Thanks for the link, by the way! I'm definitely going to read through the whole thing. I'm desperately trying to understand this technology.