Comment by gpm
10 hours ago
I'd disagree, the other training on top doesn't alter the fundamental nature of the model that it's predicting the probabilities of the next token (and then there's a sampling step which can roughly be described as picking the most probable one).
It just changes the probability distribution that it is approximating.
To the extent that thinking is making a series of deductions from prior facts, it seems to me that thinking can be reduced to "pick the next most probable token from the correct probability distribution"...
The fundamental nature of the model is that it consumes tokens as input and produces token probabilities as output, but there's nothing inherently "predictive" about it -- that's just perspective hangover from the historical development of how LLMs were trained. It is, fundamentally, I think, a general-purpose thinking machine, operating over the inputs and outputs of tokens.
(With this perspective, I can feel my own brain subtly oferring up a panoply of possible responses in a similar way. I can even turn up the temperature on my own brain, making it more likely to decide to say the less-obvious words in response, by having a drink or two.)
(Similarly, mimicry is in humans too a very good learning technique to get started -- kids learning to speak are little parrots, artists just starting out will often copy existing works, etc. Before going on to develop further into their own style.)
Put a loop around an LLM and, it can be trivially made Turing complete, so it boils down to whether thinking requires exceeding the Turing computable, and we have no evidence to suggest that is even possible.
What are you doing in your loop?
As typically deployed [1] LLMs are not turing complete. They're closer to linear bounded automaton, but because transformers have a strict maximum input size they're actually a subset of the weaker class of deterministic finite automaton. These aren't like python programs or something that can work on as much memory as you supply them, their architecture works on a fixed maximum amount of memory.
I'm not particularly convinced turing complete is the relevant property though. I'm rather convinced that I'm not turing complete either... my head is only so big after all.
[1] i.e. in a loop that appends output tokens to the input and has some form of sliding context window (perhaps with some inserted instructions to "compact" and then sliding the context window right to after those instructions once the LLM emits some special "done compacting" tokens).
[2] Common sampling procedures make them mildly non-deterministic, but I don't believe they do so in a way that changes the theoretical class of these machines from DFAs.
Context effectively provifes an IO port, and so all the loop needs to do is to simulate the tape head, and provide a single token of state.
You can not be convinced Turing complete is relevant all you want - we don't know of any more expansive category of computable functions, and so given that an LLM in the setup described is Turing complete no matter that they aren't typically deployed that way is irrelevant.
They trivially can be, and that is enough to make the shallow dismissal of pointing out they're "just" predicting the next token meaningless.
Turing Machines don't need access to the entire tape all at once, it's sufficient for it to see one cell at a time. You could certainly equip an LLM with a "read cell", "write cell", and "move left/right" tool and now you have a Turing machine. It doesn't need to keep any of its previous writes or reads in context. A sliding context window is more than capacious enough for this.
2 replies →
No physically realizable machine is technically turing complete.
But it is trivially possible to give systems-including-LLMs external storage that is accessible on demand.
> whether thinking requires exceeding the Turing computable
I've never seen any evidence that thinking requires such a thing.
And honestly I think theoretical computational classes are irrelevant to analysing what AI can or cannot do. Physical computers are only equivalent to finite state machines (ignoring the internet).
But the truth is that if something is equivalent to a finite state machine, with an absurd number of states, it doesn't really matter.