Comment by wongarsu
2 years ago
They are an especially useful tool right now, that might become less valuable as we get better at building LLMs. In principle the inner working of an LLM can be anything from a Markov-chain-like predictor to a beyond-human intelligence. Token prediction is the input/output format we chose, but you could communicate with a human in the same format and the human would show human-level intelligence.
What makes Markov chains such a great pedagogic tool right now is that they share (approximately) the same interface, being a token predictor, and that current LLMs are much closer to the capabilities of a fantastically good Markov chain than those of an above-human intelligence.
> In principle the inner working of an LLM can be anything from a Markov-chain-like predictor to a beyond-human intelligence.
I'm afraid I have to disagree. Next-token prediction isn't just the interface we use for LLMs, it is fundamentally what they are, to the very core. The training and loss function of the foundation models are completely oriented towards next-token accuracy.
Reasonable people can disagree about emergent behavior and if/how much the model is "planning ahead" in its weights (and what that could even mean) but it is emphatically not the case that the "next token" model is "just an interface". The analogy to human thought isn't accurate at all: we have our own recursive/iterative thought process, short and long term memory, decision making loops, etc.
A LLMs has no "thought" outside of next-token prediction and no working memory aside from its context window. We don't fully understand all the emergent behavior of a transformer model but we definitely understand exactly what's happening at the mechanical level: each token is determined, one at a time, by solving an extremely complex but deterministic equation in which the model weights are coefficients, and the output of which is a probability distribution over the next token.
There's no hidden intelligence or man behind the curtain. Whatever a LLM can do, next token prediction is how it does it.
>The training and loss function of the foundation models are completely oriented towards next-token accuracy.
This doesn't mean anything.
Loss function and training only concern themselves with the result of the prediction. The in-between, the computation, training does not care except as a means to an end.
It's not Input A > Output B. It's Input A > Computation > Output B.
That Computation could quite literally be anything. And no, we do not automatically know what this computation might represent or if it even represents anything that would be understandable by us.
If you train a Transformer meticulously on predicting the token that is the result of an addition, you might hope the computation it learns is some algorithm for addition but you wouldn't actually know until you attempted to and successfully probed the model.
But in an LLM it is not an arbitrary computation. Very specifically, it is a single forward pass through a neural network.
Neural networks are very general function approximators so yes, there is some room for emergent behavior. But it could _not_ be "quite literally anything." It's plugging in values for a single (very big) equation.
I think we do ourselves a disservice by pretending it's more of a black box than it is.
11 replies →