← Back to context

Comment by yobbo

2 years ago

If we let Markov model mean a probability distribution over a sequence defined as P[ X(t) | X(t-1) ] , then a transformer is specifically not a Markov model. The Markov property means each element is conditional on only the previous element ("context length" = 1).

A discrete table based probability distribution over possible words, as in the link, is a Markov language model, but not a meaningful contestant among Markov models for language modelling. The latest contestants here are things like RWKV, ResNet, SSMs, and so on which might be better thought of as HMMs.

An HMM is a Markov chain of hidden ("latent") states, rather than over tokens or words. Variations have been used for speech recognition and language modelling for 20-30 years.