← Back to context

Comment by sailingparrot

3 months ago

You are forgetting about attention on the kv-cache, which is the mechanism that allows LLM to not start anew everytime.

I mean sure, but that is a performance optimization that doesn't really change what is going on.

It's still recalculating, just that intermediate steps are cached.

  • Isn't the ability to store past reasoning in an external system to avoid having to do the computation all over again precisely what a memory is though?

    But mathematically KV-caching, instead of doing prefilling at every token is equivalent, sure. But the important part of my message was the attention.

    A plan/reasoning made during the forward pass of token 0 can be looked at by subsequent (or parallel if you don’t want to use the cache) passes of token 1,…,n. So you cannot consider token n to be starting from scratch in terms of reasoning/planning as it can reuse what has already been planned in previous tokens.

    If you think about inference with KV-caching, even though you are right that mathematically it’s just an optimization, it makes this behavior much more easy to reason about: the kv-cache is a store of past internal states, that the model can attend to for subsequent tokens, which allows that subsequent token internal hidden states to be more than just a repetition of what the model already reasoned about in the past.