Comment by zozbot234
6 days ago
The storage requirements for large-model KV caches are actually comparatively tiny: the per-token size grows far less than model parameters. Of course, we're talking "tiny" for stashing them on bulk storage and slowly fetching them back to RAM. But that should still be viable for very long context, since the time for running prefill is quadratic.
We only have open models to go by, so looking at GLM 5.1 for instance, we're talking about almost 300 GB of kv-cache for a full context window of 200k tokens.
That's hardly tiny.