Comment by theshrike79
5 hours ago
Think of it like this: Anthropic has to keep a full virtual machine running just for you. How long should it idle there taking resources when you only pay a static monthly fee and not hourly?
They have a limited number of resources and can’t keep everyone’s VM running forever.
I pay $5/mo to Vultr for a VM that runs continuously and maintains 25GB of state.
That price at Vultr gets you 1GB of RAM, and 25GB of relatively slow SSD.
The KV cache of your Claude context is:
- Potentially much larger than 25GB. (The KV cache sizes you see people quoting for local models are for smaller models.)
- While it's being used, it's all in RAM.
- Actually it's held in special high-performance GPU RAM, precision-bonded directly to the silicon of ludicrously expensive, state of the art GPUs.
- The KV state memory has to be many thousands of times faster than your 25GB state.
- It's much more expensive per GB than the CPU memory used by a VM. And that in turn is much more expensive than the SSD storage of your 25GB.
- Because Claude is used by far more people (and their agents) than rent VMs, far more people are competing to use that expensive memory at the same time
There is a lot going on to move KV cache state between GPU memory and dedicated, cheaper storage, on demand as different users need different state. But the KV cache data is so large, and used in its entirety when the context is active, that moving it around is expensive too.
It does not. It just has a fast way to give you the illusion it "runs continuously" with 25GB of warm memory.
Tbh, I'm not sure paged vram could solve this problem for an (assumed) huge cache miss system such as a major LLM server