Comment by nonameiguess

2 years ago

I have no idea if the author is aware, but it's worth noting why a Markov chain has the name and what the difference is from other probabilistic models. The Markov property states that the probability distribution of the next state in some system depends only upon the current state.

Obviously, language does not have this property and this has been known from the start, but Markov models are extremely computationally tractable, easy to implement, and easy to understand, while doing a good enough job. Introducing recursion and recurrent layers into multilayer neural architectures allowed explicit modeling of variable-length past context for the current state, which is a more accurate representation of how language works, but these were quite expensive to train. Transformer models introduced the attention mechanism to simulate recurrence without explicitly encoding it, reducing the training cost to make it tractable to train equivalently capable models with larger parameter sets on larger training sets, giving us the large language model.

This ability to explicitly capture variable-length context lookback is what makes things like few-shot and one-shot learning possible. The probability distribution is effectively self-modifying at inference time. It's not quite like an animal brain. There is no true plasticity with the strict separation between training and inference, but it gets a lot closer than a Markov model.

You can see in a lot of the comments here about use cases people had for Markov models where they shine. If you're making a bot intended to model one specific topical forum on a single web site, context variance is reduced compared to trying to converse with and understand arbitrary people in arbitrary situations. You're able to capture the relevant context based on how you select your training data. In contrast, current LLMs allow you to train on all text you can find anywhere and the model will perform well in any context.

I've seen Markov models mentioned a lot in these context, and my generous take has always been that something like stacked Markov models is meant. At each abstraction layer, the state is conditioned only by the previous abstract concept. At the lowest level the states would be the sequence of tokens; higher up it's concepts like turn of events in a plot. I don't think this often proposed idea of hierarchy is sufficient to describe LLMs or human cognition, but it strikes at some essence about parsimony, efficient representation, and local computation.