Comment by wizzwizz4
3 months ago
If a process is necessary for performing a task, (sufficiently-large) neural networks trained on that task will approximate that process. That doesn't mean they're doing it anything resembling efficiently, or that a different architecture / algorithm wouldn't produce a better result.
I’m not arguing about efficiency though ? Simply saying next token predictors cannot be thought of as actually just thinking about the next token with no long term plan.
They rebuild the "long term plan" anew for every token: there's no guarantee that the reconstructed plan will remain similar between tokens. That's not how planning normally works. (You can find something like this every time there's this kind of gross inefficiency, which is why I gave the general principle.)
> They rebuild the "long term plan" anew for every token
Well no, there is attention in the LLM which allows it to look back at it's "internal thought" during the previous tokens.
Token T at layer L, can attend to a projection of the hidden states of all tokens < T at L. So its definitely not starting anew at every token and is able to iterate on an existing plan.
Its not a perfect mechanism for sure, and there is work to make LLMs able to carry more information forward (e.g. feedback transformers), but they can definitely do some of that today.
10 replies →
Actually, due to using causal (masked) attention, new tokens appended to the input don't have any effect on what's calculated internally (the "plan") at earlier positions in the input, and a modern LLM therefore uses a KV cache rather than recalculating at those earlier positions.
In other words, the "recalculated" plan will be exactly the same as before, just extended with new planning at the position of each newly appended token.
5 replies →
Right, and this is what "reasoning LLMs" work around by having explicitly labelly "reasoning tokens".
This lets them "save" the plan between tokens, so when regenerating the new token it is following the plan.
It also doesn’t mean they’re doing it inefficiently.
I read this to mean “just because the process doesn’t match the problem, that doesn’t mean it’s inefficient”. But I think it does mean that. I expect we intuitively know that data structures which match the structure of a problem are more efficient than those that don’t. I think the same thing applies here.
I realize my argument is hand wavey, i haven’t defined “efficient“ (in space? Time? Energy?), and there are other shortcomings, but I feel this is “good enough” to be convincing
Example: a list of (key, value) pairs is a perfectly valid way to implement a map, and suffices. However, a more complicated tree structure, perhaps with hashed keys, is usually way more efficient, which is increasingly-noticeable as the number of pairs stored in the map grows large.
I suppose there’s something in what you’re saying, it’s just that’s it’s sorta vague and hard to parse for me. It also depends on the higher order problem space, for example: is it efficient if the problem is defined by “make something that can adapt to a problem space and solve it without manual engineering” rather than “make something with a long lead up time where you understand the problem space in advance and therefore have time to optimise the engine”. In the former, the neural network would indeed count as solving this efficiently, because of the given definition of the goal.