I like your question, and I cannot answer it. But I have a benchmark: I can write a Markov chain "language model" in around 10-20 lines of Python, with zero external libraries -- with tokenization and "training" on a text file, and generating novel output. I wrote it in several minutes and didn't bother to save it.
I'm curious how much time & code it would take to implement this LLM stuff at a similar level of quality and performance.
A typical demonstration markov chain probably has a length of around 3. A typical recent LLM probably has more than three billion parameters. That's not precisely apppes to apples, but the LLM is certainly vastly more complicated.
Number of parameters is not the difference. A Markov chain can easily be a multi-dimensional matrix with millions of entries. The significant difference is that a length 3 Markov chain can only ever find connections between 3 adjacent symbols (words, usually). LLMs seem to be able to find and connect abstract concepts at a very long and variable distances in the input.
Nevertheless I agree with the premise of the posting. I used Markov chains recently to teach someone what a statistical model of language is, followed by explaining to them the perceptron, and then (hand waving a bit) explaining how many large, deep layers scales everything up massively.
The overall structure is the same as in "use statistics to predict the next token."
With a Markov chain, the statistics are as simple as a mapping of n-grams to the number of times it appears in the corpus.
With a LLM, the statistics are the result of 50 years of research in neural network architectures, terabytes of training data, and many millions of dollars worth of hardware, along with the teams to build and manage all the data pipelines.
I like your question, and I cannot answer it. But I have a benchmark: I can write a Markov chain "language model" in around 10-20 lines of Python, with zero external libraries -- with tokenization and "training" on a text file, and generating novel output. I wrote it in several minutes and didn't bother to save it.
I'm curious how much time & code it would take to implement this LLM stuff at a similar level of quality and performance.
FLOPs by perplexity by samples is an interesting way to compare this family of models.
Generally LLM architectures are pretty low code, I thought (not written one myself).
Then all of the complexity comes with the training/weight data.
A typical demonstration markov chain probably has a length of around 3. A typical recent LLM probably has more than three billion parameters. That's not precisely apppes to apples, but the LLM is certainly vastly more complicated.
Number of parameters is not the difference. A Markov chain can easily be a multi-dimensional matrix with millions of entries. The significant difference is that a length 3 Markov chain can only ever find connections between 3 adjacent symbols (words, usually). LLMs seem to be able to find and connect abstract concepts at a very long and variable distances in the input.
Nevertheless I agree with the premise of the posting. I used Markov chains recently to teach someone what a statistical model of language is, followed by explaining to them the perceptron, and then (hand waving a bit) explaining how many large, deep layers scales everything up massively.
Seemingly? Is there not a direct technical reason to compare?
2 replies →
The way you describe it, it doesn't seem much more complicated to me, from a “how does it work” perspective, just way bigger.
The overall structure is the same as in "use statistics to predict the next token."
With a Markov chain, the statistics are as simple as a mapping of n-grams to the number of times it appears in the corpus.
With a LLM, the statistics are the result of 50 years of research in neural network architectures, terabytes of training data, and many millions of dollars worth of hardware, along with the teams to build and manage all the data pipelines.
So yes, much more complicated.
1 reply →
If the minimal representation of a model of the behavior is "way bigger", why are you disputing that it's more complicated? What's the difference?
2 replies →