Comment by nl

13 hours ago

> upload and restore it when the user starts their next interaction

The data is the conversation (along with the thinking tokens).

There is no download - you already have it.

The issue is that it gets expunged from the (very expensive, very limited) GPU cache and to reload the cache you have to reprocess the whole conversation.

That is doable, but as Boris notes it costs lots of tokens.

1 comment

vanviegen 7 hours ago

You're quite confidently wrong! :-)

The kv-cache is the internal LLM state after having processed the tokens. It's big, and you do not have it locally.