Comment by dbeardsl
19 hours ago
I appreciate the reply, but I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.
I feel like that is a choice best left up to users.
i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"
Another way to think about it might be that caching is part of Anthropic's strategy to reduce costs for its users, but they are now trying to be more mindful of their costs (probably partly due to significant recent user growth as well as plans to IPO which demand fiscal prudence).
Perhaps if we were willing to pay more for our subscriptions Anthropic would be able to have longer cache windows but IDK one hour seems like a reasonable amount of time given the context and is a limitation I'm happy to work around (it's not that hard to work around) to pay just $100 or $200 a month for the industry-leading LLM.
Full disclosure: I've recently signed up for ChatGPT Pro as well in addition to my Claude Max sub so not really biased one way or the other. I just want a quality LLM that's affordable.
I might be willing to pay more, maybe a lot more, for a higher subscription than claude max 20x, but the only thing higher is pay per token and i really dont like products that make me have to be that minutely aware of my usage, especially when it has unpredictability to it. I think there's a reason most telecoms went away from per minute or especially per MB charging. Even per GB, as they often now offer X GB, and im ok with that on phone but much less so on computer because of the unpredictability of a software update size.
Kinda like when restaurants make me pay for ketchup or a takeaway box, i get annoyed, just increase the compiled price.
Token anxiety is real mental overhead.
That doesn’t make sense to pay more for cache warming. Your session for the most part is already persisted. Why would it be reasonable to pay again to continue where you left off at any time in the future?
Because it significantly increases actual costs for Anthropic.
If they ignored this then all users who don’t do this much would have to subsidize the people who do.
3 replies →
Genuine question: is the cost to keep a persistent warmed cache for sessions idling for hours/days not significant when done for hundreds of thousands of users? Wouldn’t it pose a resource constraint on Anthropic at some point?
4 replies →
Sure, it wouldn’t make sense if they only had one customer to serve :)
Exactly, even in the throes of today's wacky economic tides, storage is still cheap. Write the model state immediately after the N context messages in cache to disk and reload without extra inference on the context tokens themselves. If every customer did this for ~3 conversations per user you still would only need a small fraction of a typical datacenter to house the drives necessary. The bottleneck becomes architecture/topology and the speed of your buses, which are problems that have been contended with for decades now, not inference time on GPUs.
1 reply →
> I was never under the impression that gaps in conversations would increase costs
The UI could indicate this by showing a timer before context is dumped.
a countdown clock telling you that you should talk to the model again before your streak expires? that's the kind of UX i'd expect from an F2P mobile game or an abandoned shopping cart nag notification
Well sure if you put it that way, they're similar. But it's either you don't see it and you get surprised by increased quota usage, or you do see it and you know what it means. Bonus points if they let you turn it off.
No need to gamify it. It's just UI.
1 reply →
Why not an automated ping message that's cheap for the model to respond to?
1 reply →
Yes!! A UI widget that shows how far along on the prompt cache eviction timelines we are would be great.
That sounds stressful.
But perhaps Claude Code could detect that you're actively working on this stuff (like typing a prompt or accessing the files modified by the session), and send keep-cache-alive pings based on that? Presumably these pings could be pretty cheap, as the kv-cache wouldn't need to be loaded back into VRAM for this. If that would work reliably, cache expiry timeouts could be more aggressive (5 min instead of an hour).
I tried to hack the statusline to show this but when i tried, i don't think the api gave that info. I'd love if they let us have more variables to access in the statusline.
By caching they mean “cached in GPU memory”. That’s a very very scarce resource.
Caching to RAM and disk is a thing but it’s hard to keep performance up with that and it’s early days of that tech being deployed anywhere.
Disclosure: work on AI at Microsoft. Above is just common industry info (see work happening in vLLM for example)
Nit: It doesn’t have to live in GPU memory. The system will use multiple levels of caching and will evict older cached data to CPU RAM or to disk if a request hasn’t recently come in that used that prefix. The problem is, the KV caches are huge (many GB) and so moving them back onto the GPU is expensive: GPU memory bandwidth is the main resource constraint in inference. It’s also slow.
The larger point stands: the cache is expensive. It still saves you money but Anthropic must charge for it.
Edit: there are a lot of comments here where people don't understand LLM prefix caching, aka the KV cache. That's understandable: it is a complex topic and the usual intuitions about caching you might have from e.g. web development don't apply: a single cache blob for a single request is in the 10s of GB at least for a big model, and a lot of the key details turn on the problems of moving it in and out of GPU memory. The contents of the cache is internal model state; it's not your context or prompt or anything like that. Furthermore, this isn't some Anthropic-specific thing; all LLM inference with a stable context prefix will use it because it makes inference faster and cheaper. If you want to read up on this subject, be careful as a lot of blogs will tell you about the KV cache as it is used within inference for an single request (a critical detail concept in how LLMs work) but they will gloss over how the KV cache is persisted between requests, which is what we're all talking about here. I would recommend Philip Kiely's new book Inference Engineering for a detailed discussion of that stuff, including the multiple caching levels.
> I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.
You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.
How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?
You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.
What is being discussed is KV caching [0], which is used across every LLM model to reduce inference compute from O(n^2) to O(n). This is not specific to Claude nor Anthropic.
[0]: https://huggingface.co/blog/not-lain/kv-caching
> How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?
1. Compute scaling with the length of the sequence is applicable to transformer models in general, i.e. every frontier LLM since ChatGPT's initial release.
2. As undocumented changes happen frequently, users should be even more incentivized to at least try to have a basic understanding of the product's cost structure.
> You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.
I think "internal technical implementation" is a stretch. Users don't need to know what a "transformer" is to understand the trade-off. It's not trivial but it's not something incomprehensible to laypersons.
I use CC, and I understand what caching means.
I have no idea how that works with a LLM implementation nor do I actually know what they are caching in this context.
2 replies →
> 99.99% of users won't even understand the words that are being used.
That's a bad estimate. Claude Code is explicitly a developer shaped tool, we're not talking generically ChatGPT here, so my guess is probably closer to 75% of those users do understand what caching is, with maybe 30% being able to explain prompt caching actually is. Of course, those users that don't understand have access to Claude and can have it explain what caching is to them if they're interested.
I somewhat disagree that this is due diligence. Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.
> Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.
Does mmap(2) educate the developer on how disk I/O works?
At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.
6 replies →
I would say this is abstracting the behavior.
Yes. It’s perfectly reasonable to expect the user to know the intricacies of the caching strategy of their llm. Totally reasonable expectation.
To some extent I'd say it is indeed reasonable. I had observed the effect for a while: if I walked away from a session I noticed that my next prompt would chew up a bunch of context. And that led me to do some digging, at which point I discovered their prompt caching.
So while I'd agree with your sarcasm that expecting users to be experts of the system is a big ask, where I disagree with you is that users should be curious and actively attempting to understand how it works around them. Given that the tooling changes often, this is an endless job.
7 replies →
It's not like they have a poweful all-knowing oracle that can explain it to them at their dispos... oh, wait!
3 replies →
Okay, sure. There's a dollar/intelligence tradeoff. Let me decide to make it, don't silently make Claude dumber because I forgot about a terminal tab for an hour. Just because a project isn't urgent doesn't mean it's not important. If I thought it didn't need intelligence I would use Sonnet or Haiku.
It is more useful to read posts and threads like this exact thread IMO. We can't know everything, and the currently addressed market for Claude Code is far from people who would even think about caching to begin with.
It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.
It seems you haven't done the due diligence on what the parent meant :)
It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.
It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.
1 reply →
I said "prompting with the entire context every time," I think it should be clear even to laypersons that the "prompting" cost refers to what the model provider charges you when you send them a prompt.
What if the cache was backed up to cold storage? Instead of having to recompute everything.
They probably already do that. But these caches can get pretty big (10s of GBs per session), so that adds up fast, even for cold storage.
How's that O(N^2)? How's it O(N) with caching? Does a 3 turn conversation cost 3 times as much with no caching, or 9 times as much?
I’m not sure that it’s O(N) with caching but this illustrates the N^2 part:
https://blog.exe.dev/expensively-quadratic
5 replies →
How big this cached data is? Wouldn't it be possible to download it after idling a few minutes "to suspend the session", and upload and restore it when the user starts their next interaction?
Should be about 10~20 GiB per session. Save/restore is exactly what DeepSeek does using its 3FS distributed filesystem: https://github.com/deepseek-ai/3fs#3-kvcache
With this much cheaper setup backed by disks, they can offer much better caching experience:
> Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days.
What they mean when they say 'cached' is that it is loaded into the GPU memory on anthropic servers.
You already have the data on your own machine, and that 'upload and restore' process is exactly what is happening when you restart an idle session. The issue is that it takes time, and it counts as token usage because you have to send the data for the GPU to load, and that data is the 'tokens'.
1 reply →
> upload and restore it when the user starts their next interaction
The data is the conversation (along with the thinking tokens).
There is no download - you already have it.
The issue is that it gets expunged from the (very expensive, very limited) GPU cache and to reload the cache you have to reprocess the whole conversation.
That is doable, but as Boris notes it costs lots of tokens.
1 reply →
I often see a local model QWEN3.5-Coder-Next grow to about 5 GB or so over the course of a session using llamacpp-server. I'd better these trillion parameter models are even worse. Even if you wanted to download it or offload it or offered that as a service, to start back up again, you'd _still_ be paying the token cost because all of that context _is_ the tokens you've just done.
The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.
This sounds like a religious cult priest blaming the common people for not understanding the cult leader's wish, which he never clearly stated.
A strange view. The trade-off has nothing to do with a specific ideology or notable selfishness. It is an intrinsic limitation of the algorithms, which anybody could reasonably learn about.
Sure, the exact choice on the trade-off, changing that choice, and having a pretty product-breaking bug as a result, are much more opaque. But I was responding to somebody who was surprised there's any trade-off at all. Computers don't give you infinite resources, whether or not they're "servers," "in the cloud," or "AI."
1 reply →
I too would far rather bear a token cost than have my sessions rot silently beneath my feet. I usually have ~5 running CC sessions, some of which I may leave for a week or two of inactivity at a time.
same here, and I suspect there are dozens of us
Instead of just dropping all the context, the system could also run a compaction (summarizing the entire convo) before dropping it. Better to continue with a summary than to lose everything.
There's problems with this approach as well I've found
I'm really beginning to feel the lack of control when it's comes to context if I'm being honest
Yes! This is what we’re trying next.
It'd probably be helpful for power users and transparency to actually show how the cache is being used. If you run local models with llamacpp-server, you can watch how the cache slots fill up with every turn; when subagents spawn, you see another process id spin up and it takes up a cache slot; when the model starts slowing down is when the context grows (amd 395+ around 80-90k) and the cache loads are bigger because you've got all that.
So yeah, it doesn't take much to surface to the user that the speed/value of their session is ephemeral because to keep all that cache active is computationally expensive because ...
You're still just running text through a extremely complex process, and adding to that text and to avoid re-calculation of the entire chain, you need the cache.
How else would you implement it?