Comment by pclmulqdq
2 years ago
That's not actually true, they still have a fixed history window. The idea that transformers capture through the attention mechanism is that not all past tokens are created equal, and that the importance of tokens in that history window depends on what they are.
They also scale differently - Markov Chains scale exponentially with the size of the window, while transformers scale quadratically. So in fact transformers are really more exponentially more efficient, though without bound on resources their capabilities are a strict subset of that of Markov chain.