Comment by b65e8bee43c2ed0

10 hours ago

the exchange rate between text and its representation in memory is brutal. here's a bit from a recent article:

>An 82 GB footprint in DDR3 on a 2016 Xeon. About 25 GB of weights and 56 GB of KV cache at the full 262K context. The KV cache is larger than the model.

262k tokens is not much at all. with ~5 characters per token, that's only 1.3 MB of plaintext.

7 comments

b65e8bee43c2ed0

londons_explore 8 hours ago

The providers must have a more efficient approach. Most cache every request for 12+ hours, and they certainly can't spare 100GB of ram per request for 12 hours.

jasonjmcghee 4 hours ago
> 12 hours
have things changed around this recently? I know openai optionally allows 24 hours but thought it was ~1h without that, and anthropic used to quote 5-15 minutes or something.
- brookst 4 hours ago
  
  Anthropic is 5 minutes, though you can pay more to get 60 minutes I believe.
  The #1 was to make Claude code token quota go further is to never let the 5 minute cache TTL expire. Either send a new request within the window, or use /clear and copy/paste, or use /clear and a framework that automatically generates session state that gets replayed from files after /clear.
dist-epoch 7 hours ago

This is one reason why price of SSDs also doubled, not just of RAM.
> LMCache extends the KV Cache from the NVIDIA GPU's fast HBM (Tier 1) to larger, more cost-effective tiers like CPU RAM and local SSDs.
https://cloud.google.com/blog/topics/developers-practitioner...
choppaface 8 hours ago
or maybe they don’t actually cache (fully) but lie and just don’t charge the user right now. at least half the users, who are probably also using the most similar tokens / prompts, wouldn’t really know the difference in latency (or care)
- londons_explore 7 hours ago
  
  If it actually cost that much RAM, they would almost certainly add extra things to the API to manage cache lifetime. Ie. A 'please cache this for X minutes' flag, or a setting for a single re-use cache (the most common use case)
  
  1 reply →