Comment by rwmj
2 years ago
Number of parameters is not the difference. A Markov chain can easily be a multi-dimensional matrix with millions of entries. The significant difference is that a length 3 Markov chain can only ever find connections between 3 adjacent symbols (words, usually). LLMs seem to be able to find and connect abstract concepts at a very long and variable distances in the input.
Nevertheless I agree with the premise of the posting. I used Markov chains recently to teach someone what a statistical model of language is, followed by explaining to them the perceptron, and then (hand waving a bit) explaining how many large, deep layers scales everything up massively.
Seemingly? Is there not a direct technical reason to compare?
The people I was instructing are not very technical, I had to hand-wave a lot. (Nevertheless I think they got a much better overview of the tech than they would have got by reading some pop-sci description.)
Reproduction context length is a standard benchmark.