Comment by bongodongobob
2 years ago
I've never seen a Markov chain do anything like GPT4. I'm not sure how you can say with a straight face they are basically the same.
2 years ago
I've never seen a Markov chain do anything like GPT4. I'm not sure how you can say with a straight face they are basically the same.
A LLM is a Markov chain with billions of associations and weights. A Makov chain is an LLM of maybe a few dozen associations and weights (so an LM, without the first L).
The difference is in the data structure and the size of the atoms/n-grams. The data structure Markov chain implementations use is not efficient for billions of parameters, either in storage or in processing. But the idea is the same: give a likely next token given the last n tokens. The value for n is a narrow window for Markov chains and an extremely wide window for LLM. LLM are able to maintain massive amounts of state compared to a Markov chain implementation.
“A human brain is just like a dog’s brain, only with more neural pathways.” True, perhaps, but largely pointless: at some point neural complexity results in a difference of kind, not of degree.
I’d argue the same is true of LLMs vs simpler models like Markov chains.
But a human brain is also just like an elephant or whale's brain except with 1/5th of the neural pathways at best, probably less.
There's a qualitative difference too.
I think this vastly underplays animals intelligence though. There is so much focus on creating human level intelligence, but where is a robot that can learn to navigate the world like a dog or cat can?
1 reply →
Idea behind complexity can be very simple, but at scale work yield in very different results.
To compare Markov Chain with an LLM is kind of like to compare a single cell organism to a human being because we both are based on cells.
As already mentioned, Markov chains were already used in practice as (poor) Turing tests (IMHO working best in art projects, and sadly, spam).
Sure, today's LLMs blow them out of the water, but the difference was much less striking with neural networks even as late as 2010.
2 replies →
It's not just window size. It's the difference between syntax and semantics.
A Markov model, by definition, works only with literal token histories. It can't participate meaningfully in a conversation unless the user happens to employ token sequences that the model has seen before (ideally multiple times.) An LLM can explain why it's not just a Markov model, but the converse isn't true.
Now, if you were to add high-dimensional latent-space embedding to a Markov model, that would make the comparison more meaningful, and would allow tractable computation with model sizes that were completely impossible to deal with before. But then it wouldn't be a Markov model anymore. Or, rather, it would still be a Markov model, but one that's based on relationships between tokens rather than just their positions in a linear list.
Another analogy might be to say that a Markov model can implement lossless compression only, while a latent-space model can implement lossy compression. There's a school of thought that says that lossy compression doesn't just require intelligence, it is intelligence, and LLMs can be seen as an example of that equivalence. Not saying I agree with that school, or that you should, but as someone else pointed out, comparing Markov chains with LLMs are at best like comparing goldfish brains with human brains.
I like: Intelligence is compressing information into irreducible representation.
Which leads to a wonderful tongue-in-cheek contraindication: representation types such as a particular model, when it’s complexity increases, especially via edge cases, it is then a result of agentic anti-intelligence.
That is to say, anything that increases in complexity without being refactored is a sign of a lack of intelligence or worse.
And any sense of information that is impossible to be reduced further while maintaining equal or more expressibility are signs of maximum agentic intelligence.
>high-dimensional latent-space embedding to a Markov model
That's what we call a hidden Markov model.
>There's a school of thought that says that lossy compression doesn't just require intelligence, it is intelligence, and LLMs can be seen as an example of that equivalence.
SVD is used to implement lossy compression as does JPEG encoding... these algorithms are in no way intelligent.
2 replies →
An LLM is a Markov chain in the same sense that a cat is a tiger, technically true but it misses the qualia.
It's not. There's fundamental architectural differences that couldn't be bigger.
A better comparison would be that it's like a windup toy versus a group of humans moving an entire civilization. They both move along a distance, but just listing the systems that the human group has that the windup toy doesn't is too long to fit on a page.
3 replies →
LLMs are Markov chains in latent space, it's the latent representation that give them their power, but ultimately there's not as much difference as one would suspect.
They're different because Markov models are stateless whereas LLMs are stateful.
https://en.wikipedia.org/wiki/Markov_property
Current LLMs are stateless as far as we know, their state when computing a new token is only the preceding text tokens, they don't store any metadata or save state from the previous calculations.
Fair, but maybe it's more of a computer science-cy type of comparison?
We say systems can perform the same types of computations if they're both Turing complete. Yet, we wouldn't implement everything in every "language" that is Turing complete.
Perhaps, every LLM could be represented as a Markov chain, and for some it even makes sense (e.g., easier to train, easier to reason about), but in most cases it's a bad idea (e.g., expensive, bad performance).
No one has spent 100M on training Markov chains.
The trick with Markov chains is that you don't need to.
Markov Chains are dead simple. There's not really a "training" as much as it's simply reading data and collecting statistics.
They're so simple that you can probably build one nearly as fast as you can read the training data.
Not 100M I guess, but this is fresh from today's arXiv:
"Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens"
https://arxiv.org/pdf/2401.17377.pdf
For the first few years of smartphones markov chains were used for autocorrect.
I think it would take some spectacularly bad engineering to be that wasteful. It would need to be so inefficient that getting ChatGPT to write the code won't be bad enough.
It would make no sense; they are not powerful models worthy of the spend.
Google did it. Amazon did it. Plenty of others did. What do you think they were doing before recurrent neural networks?
100 million USD?
No.
I'd believe $40 for the energy cost, $120 for very slightly increased wear on their hard drives, and $400 for one engineer's 20% time project for one week.
And that's if it was trained on Google's entire internet cache rather than, say, just a Wikipedia snapshot from 2004, which sounds like the kind of thing that Google might have set as a pre-interview coding challenge.
[dead]