Comment by rao-v

2 days ago

The directly into working memory bit is nonsense of course, but it does point to a problem that is probably worth solving.

What would it take to make the KV cache more portable and cut/paste vs. highly specific to the query?

In theory today, I should be able to process <long quote from document> <specific query> and just stop after the long document and save the KV cache right? The next time around, I can just load it in, and continue from <new query>?

To keep going, you should be able to train the model to operate so that you can have discontinous KV cache segments that are unrelated, so you can drop in <cached KV from doc 1> <cached KV from doc 2> with <query related to both> and have it just work ... but I don't think you can do that today.

I seem remember seeing some papers that tried to "unRoPE" the KV and then "re-RoPE" it, so it can be reused ... but I have not seen the latest. Anybody know what the current state is?

Seems crazy to have to re-process the same context multiple times just to ask it a new query.

3 comments

rao-v

gettincrafty 1 day ago

Do you have any links to the papers for the “unRoPE” and “re-Rope” technique? I tried some searching and couldn’t find anything. I would love to look into this idea more.

I think that copy/paste-able KV cache idea sounds pretty promising. It might lose some of the inter-document context and attention that would get built up in the hidden state of the model as it processes the prompt. Maybe just throw in some ‘reasoning’ tokens before it gives its answer to give it a chance to attend cross-document

whimsicalism 1 day ago

would loading the KV cache from disk be faster than just recomputing it?

imo the discontinuous segments bit would not work because of the causal dependence in transformers + RoPE as you mention, but maybe could be possible

yorwba 1 day ago

> In theory today, I should be able to process <long quote from document> <specific query> and just stop after the long document and save the KV cache right?

People do this, it's called prefix caching.

There's also https://arxiv.org/abs/2506.06266 where they compress the context down to a smaller representation they call a "cartridge," and composing cartridges from different contexts seems to work reasonably well.