Comment by thfuran
2 years ago
A typical demonstration markov chain probably has a length of around 3. A typical recent LLM probably has more than three billion parameters. That's not precisely apppes to apples, but the LLM is certainly vastly more complicated.
Number of parameters is not the difference. A Markov chain can easily be a multi-dimensional matrix with millions of entries. The significant difference is that a length 3 Markov chain can only ever find connections between 3 adjacent symbols (words, usually). LLMs seem to be able to find and connect abstract concepts at a very long and variable distances in the input.
Nevertheless I agree with the premise of the posting. I used Markov chains recently to teach someone what a statistical model of language is, followed by explaining to them the perceptron, and then (hand waving a bit) explaining how many large, deep layers scales everything up massively.
Seemingly? Is there not a direct technical reason to compare?
The people I was instructing are not very technical, I had to hand-wave a lot. (Nevertheless I think they got a much better overview of the tech than they would have got by reading some pop-sci description.)
Reproduction context length is a standard benchmark.
The way you describe it, it doesn't seem much more complicated to me, from a “how does it work” perspective, just way bigger.
The overall structure is the same as in "use statistics to predict the next token."
With a Markov chain, the statistics are as simple as a mapping of n-grams to the number of times it appears in the corpus.
With a LLM, the statistics are the result of 50 years of research in neural network architectures, terabytes of training data, and many millions of dollars worth of hardware, along with the teams to build and manage all the data pipelines.
So yes, much more complicated.
You can have very complex calculations and a simple output. The complexity of the process to find the weights is not necessarily the same complexity than the process using them. The numbers don't get suddenly special because of how they were calculated (like 42 is just the number 42 even after 7 million years of calculations).
If the minimal representation of a model of the behavior is "way bigger", why are you disputing that it's more complicated? What's the difference?
The difference is the capabilities. LLMs don't necessarily need billions of Parameters. In fact, useful models (like the one used in never Apple devices' autocomplete) has only like 50 million. Markov chain... I guess there is probably a reason why we don't use them instead of neuronal networks. Maybe somebody more knowledgeable can enlighten us, but I suspect one might need magnitudes more parameters.
1 reply →