Comment by srean

4 days ago

In the discussion thread there seems to be a somewhat contentious argument about what is a Markov Model, whether LLMs are one, RNNs are one and so on.

Markov Models are anything that has state and emit tokens based only on its current state and undergoes a state transition. The token emission and state transitions are usually probabilistic -- a statistical/probabilistic analogue of a state machine. The deterministic state machine is a special case where the transition probabilities are degenerate (concentrated at an unique point).

For a Markov Model to be non-vacuous, non-vapid discussion point, however, one needs to specify very precisely the relationships allowed between state and tokens/observations, whether it's hidden or visible, discrete or continuous, fixed context length or variable context length, causal or non causal ...

The simplest such model is one where the state is a specified, computable function of the last k observations. One such simple function is the identity function -- the state then is the last k tokens. This is called a k order Markov Chain and is a restriction of the bigger class -- Markov Models.

One can make the state a specified, computable function of (k) previous states and k most recent tokens/observations. (Equivalently RNNs)

The functions may be specified only upto a class of computable functions, finite or infinite in size. They may be stochastic in the sense they define only the state transition probabilities.

You can make the context length a computable function of the k most recent observations (therefore they can be of varying length), but you have to ensure that the contexts are always full for this model to be well defined.

Context length can be a computable function of both the (el) most recent states and k most recent observations.

Crazy ones emit more than one token based on current state.

On and on.

Not all Markov Models are learnable.

This is right in terms of the rigorous statistical sense of “Markov model”. But in practice in the world of NLP and chatbots through the 90s and 2000s, “Markov model” was usually used to refer to Markov chains (ie you only condition on the previous k words). Hence the term “Hidden Markov Model” to refer to what you’re calling a Markov model.

  • I covered hidden Markov Models briefly in my comment.

    It depends on whether the state is visible in the observations or not. Hidden or not is an orthogonal axis of variation compared to the other variations mentioned in the comment.

    In a non-hidden model there is no ambiguity or uncertainty about what the current state is.