Comment by tikkabhuna

12 hours ago

I’m coming at this as a complete Claude amateur, but caching for any other service is an optimisation for the company and transparent for the user. I don’t think I’ve ever used a service and thought “oh there’s a cache miss. Gotta be careful”.

I completely agree that it’s infeasible for them to cache for long periods of time, but they need to surface that information in the tools so that we can make informed decisions.

That is because LLM KV caching is not like caches you are used to (see my other comments, but it's 10s of GB per request and involves internal LLM state that must live on or be moved onto a GPU and much of the cost is in moving all that data around). It cannot be made transparent for the user because the bandwidth costs are too large a fraction of unit economics for Anthropic to absorb, so they have to be surfaced to the user in pricing and usage limits. The alternative is a situation where users whose clients use the cache efficiently end up dramatically subsidizing users who use it inefficiently, and I don't think that's a good solution at all. I'd much rather this be surfaced to users as it is with all commercial LLM apis.

Think of it like this: Anthropic has to keep a full virtual machine running just for you. How long should it idle there taking resources when you only pay a static monthly fee and not hourly?

They have a limited number of resources and can’t keep everyone’s VM running forever.

  • I pay $5/mo to Vultr for a VM that runs continuously and maintains 25GB of state.

    • That price at Vultr gets you 1GB of RAM, and 25GB of relatively slow SSD.

      The KV cache of your Claude context is:

      - Potentially much larger than 25GB. (The KV cache sizes you see people quoting for local models are for smaller models.)

      - While it's being used, it's all in RAM.

      - Actually it's held in special high-performance GPU RAM, precision-bonded directly to the silicon of ludicrously expensive, state of the art GPUs.

      - The KV state memory has to be many thousands of times faster than your 25GB state.

      - It's much more expensive per GB than the CPU memory used by a VM. And that in turn is much more expensive than the SSD storage of your 25GB.

      - Because Claude is used by far more people (and their agents) than rent VMs, far more people are competing to use that expensive memory at the same time

      There is a lot going on to move KV cache state between GPU memory and dedicated, cheaper storage, on demand as different users need different state. But the KV cache data is so large, and used in its entirety when the context is active, that moving it around is expensive too.

    • It does not. It just has a fast way to give you the illusion it "runs continuously" with 25GB of warm memory.

      Tbh, I'm not sure paged vram could solve this problem for an (assumed) huge cache miss system such as a major LLM server