Comment by londons_explore
8 hours ago
The providers must have a more efficient approach. Most cache every request for 12+ hours, and they certainly can't spare 100GB of ram per request for 12 hours.
8 hours ago
The providers must have a more efficient approach. Most cache every request for 12+ hours, and they certainly can't spare 100GB of ram per request for 12 hours.
> 12 hours
have things changed around this recently? I know openai optionally allows 24 hours but thought it was ~1h without that, and anthropic used to quote 5-15 minutes or something.
Anthropic is 5 minutes, though you can pay more to get 60 minutes I believe.
The #1 was to make Claude code token quota go further is to never let the 5 minute cache TTL expire. Either send a new request within the window, or use /clear and copy/paste, or use /clear and a framework that automatically generates session state that gets replayed from files after /clear.
This is one reason why price of SSDs also doubled, not just of RAM.
> LMCache extends the KV Cache from the NVIDIA GPU's fast HBM (Tier 1) to larger, more cost-effective tiers like CPU RAM and local SSDs.
https://cloud.google.com/blog/topics/developers-practitioner...
or maybe they don’t actually cache (fully) but lie and just don’t charge the user right now. at least half the users, who are probably also using the most similar tokens / prompts, wouldn’t really know the difference in latency (or care)
If it actually cost that much RAM, they would almost certainly add extra things to the API to manage cache lifetime. Ie. A 'please cache this for X minutes' flag, or a setting for a single re-use cache (the most common use case)
https://platform.claude.com/docs/en/build-with-claude/prompt...
suggests the can cache outside the gpu.