Comment by sailingparrot
3 months ago
I don't follow how this relates to what we are discussing. Autoregressive LLMs are able to plan within a single forward pass and are able to look back at their previous reasoning and do not start anew at each token like you said.
If you append tokens from another source, like in a turn base conversation, then the LLM will process all the new appended tokens in parallel while still being able to look back at it's previous internal state (and thus past reasoning/planning in latent space) from the already processed tokens, then will adjust the plan based on the new information.
What happens to you as a human if you come up with a plan with limited information and new information is provided to you?
Not the original person you are replying to, but I wanted to add:
Yes, they can plan within a single forward pass like you said, but I still think they "start anew at each token" because they have no state/memory that is not the output.
I guess this is differing interpretations of the meaning of "start anew", but personally I would agree that having no internal state and simply looking back at it's previous output to form a new token is "starting anew".
But I'm also not well informed about the topic so happy to be corrected.
But you are missing the causal attention from your analysis. The output is not the only thing that is preserved, there is also the KV-cache.
At token 1, the model goes through, say, 28 transformer blocks, for each one of those block we save 2 projections of the hidden state in a cache.
At token 2, on top of seeing the new token, the model is now also able in each one of those 28 blocks, to look at the previously saved hidden states from token 1.
At token 3, it can see the states from token 2 and 1 etc.
However I still agree that is not a perfect information-passing mechanism because of how those model are trained (and something like feedback transformer would be better), but information still is very much being passed from earlier tokens to later ones.
Like an other commenter said, isn't the KV cache a performance optimization to not have to redo work that was already done ? Or does it fundamentally alter the output of the LLM, and so preserves state that is not present in the output of the LLM ?
1 reply →
Worth noting here for others following that a single forward pass is what generates a single token.
It's correct to states the LLM starts anew for each token.
The work around for this is to pass the existing plan back into it as part of the context.
You are forgetting about attention on the kv-cache, which is the mechanism that allows LLM to not start anew everytime.
2 replies →