Comment by porridgeraisin

5 hours ago

I wanted this as well. Even asked about it at an openai talk. Basically a way to get the KV cache to the client (they can encrypt it if they care about me REing it, make a compressed latent if they don't wanna egress 20GB, whatever, I'm fine with a black box) so that I can load it later and avoid these cache misses.

I think the primary reason they cannot do this is that they change the memory and communication layouts in their serving stack rather aggressively. And naturally keeping the KV cache portable across all such layouts is a very difficult task. So you'd have to version the cache down to a specific deployment, and invalidate it the moment anything even small changes. So giving the user a handle to the cache sort of prevents you from making large changes to memory layout. Which is I suppose not that enticing. Also, client side KV caches are only meaningful in today's 1M contexts. Few y back it wasn't necessary, since just recomputing would be better for everybody.

To be clear, I don't mean they send it along with every request. Rather, they do their current TTL cache, and then when I'm at the end of a session, I request it in one shot and then close the session. And it doesn't have to come to the literal client, they can egress it to a storage service that we pay for, whatever. But ya the compat problem makes it all a non starter.