Comment by HarHarVeryFunny

4 months ago

Actually, due to using causal (masked) attention, new tokens appended to the input don't have any effect on what's calculated internally (the "plan") at earlier positions in the input, and a modern LLM therefore uses a KV cache rather than recalculating at those earlier positions.

In other words, the "recalculated" plan will be exactly the same as before, just extended with new planning at the position of each newly appended token.

5 comments

HarHarVeryFunny

astrange 4 months ago

You can violate the plan in the sampler by making an "unreasonable" choice of next token to sample (eg by raising the temperature.) So if it does stick to the same plan, it's not going to be a very good one.

HarHarVeryFunny 4 months ago
Yeah.
Karpathy recently referred to LLMs having more "working memory" than a human, apparently referring to these unchanging internal activations as "memory", but it's an odd sort of "working memory" if you can't actually update it to reflect progress on what you are working on, or update per new information (new unexpected token having been sampled).
- sailingparrot 4 months ago
  
  I think a better mental framework of how those model work is that they keep an history of the state of their "memory" across time.
  Where humans have a single evolving state of our memory LLMs have access to all the states of their "memories" across time, and while past state can't be changed, the new state can: This is the current token's hidden state, and to form this new state they look both at the history of previous states as well as the new information (last token having been sample, or external token from RAG or whatnot appended to the context).
  This is how progress is stored.
  
  2 replies →