← Back to context

Comment by computably

20 hours ago

> I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.

You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.

How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?

You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.

  • > How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?

    1. Compute scaling with the length of the sequence is applicable to transformer models in general, i.e. every frontier LLM since ChatGPT's initial release.

    2. As undocumented changes happen frequently, users should be even more incentivized to at least try to have a basic understanding of the product's cost structure.

    > You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.

    I think "internal technical implementation" is a stretch. Users don't need to know what a "transformer" is to understand the trade-off. It's not trivial but it's not something incomprehensible to laypersons.

  • I use CC, and I understand what caching means.

    I have no idea how that works with a LLM implementation nor do I actually know what they are caching in this context.

    • They are caching internal LLM state, which is in the 10s of GB for each session. It's called a KV cache (because the internal state that is cached are the K and V matrices) and it is fundamental to how LLM inference works; it's not some Anthropic-specific design decision. See my other comment for more detail and a reference.

  • > 99.99% of users won't even understand the words that are being used.

    That's a bad estimate. Claude Code is explicitly a developer shaped tool, we're not talking generically ChatGPT here, so my guess is probably closer to 75% of those users do understand what caching is, with maybe 30% being able to explain prompt caching actually is. Of course, those users that don't understand have access to Claude and can have it explain what caching is to them if they're interested.

I somewhat disagree that this is due diligence. Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

  • > Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

    Does mmap(2) educate the developer on how disk I/O works?

    At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.

Okay, sure. There's a dollar/intelligence tradeoff. Let me decide to make it, don't silently make Claude dumber because I forgot about a terminal tab for an hour. Just because a project isn't urgent doesn't mean it's not important. If I thought it didn't need intelligence I would use Sonnet or Haiku.

Yes. It’s perfectly reasonable to expect the user to know the intricacies of the caching strategy of their llm. Totally reasonable expectation.

  • To some extent I'd say it is indeed reasonable. I had observed the effect for a while: if I walked away from a session I noticed that my next prompt would chew up a bunch of context. And that led me to do some digging, at which point I discovered their prompt caching.

    So while I'd agree with your sarcasm that expecting users to be experts of the system is a big ask, where I disagree with you is that users should be curious and actively attempting to understand how it works around them. Given that the tooling changes often, this is an endless job.

    • > users should be curious and actively attempting to understand how it works

      Have you ever talked with users?

      > this is an endless job

      Indeed. If we spend all our time learning what changed with all our tooling when it changes without proper documentation then we spend all our working lives keeping up instead of doing our actual jobs.

      8 replies →

It is more useful to read posts and threads like this exact thread IMO. We can't know everything, and the currently addressed market for Claude Code is far from people who would even think about caching to begin with.

It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.

  • It seems you haven't done the due diligence on what the parent meant :)

    It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.

    It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.

    • You not only skipped the diligence but confused everyone repeating what I said :(

      that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt).

      The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens.

  • I said "prompting with the entire context every time," I think it should be clear even to laypersons that the "prompting" cost refers to what the model provider charges you when you send them a prompt.

What if the cache was backed up to cold storage? Instead of having to recompute everything.

  • They probably already do that. But these caches can get pretty big (10s of GBs per session), so that adds up fast, even for cold storage.

How's that O(N^2)? How's it O(N) with caching? Does a 3 turn conversation cost 3 times as much with no caching, or 9 times as much?

  • I’m not sure that it’s O(N) with caching but this illustrates the N^2 part:

    https://blog.exe.dev/expensively-quadratic

    • If there was an exponential cost, I would expect to see some sort of pricing based on that. I would also expect to see it taking exponentially longer to process a prompt. I don't believe LLMs work like that. The "scary quadratic" referenced in what you linked seems to be pointing out that cache reads increase as your conversation continues?

      If I'm running a database keeping track of a conversation, and each time it writes the entire history of the conversation instead of appending a message, are we calling that O(N^2) now?

      4 replies →

How big this cached data is? Wouldn't it be possible to download it after idling a few minutes "to suspend the session", and upload and restore it when the user starts their next interaction?

  • Should be about 10~20 GiB per session. Save/restore is exactly what DeepSeek does using its 3FS distributed filesystem: https://github.com/deepseek-ai/3fs#3-kvcache

    With this much cheaper setup backed by disks, they can offer much better caching experience:

    > Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days.

  • What they mean when they say 'cached' is that it is loaded into the GPU memory on anthropic servers.

    You already have the data on your own machine, and that 'upload and restore' process is exactly what is happening when you restart an idle session. The issue is that it takes time, and it counts as token usage because you have to send the data for the GPU to load, and that data is the 'tokens'.

    • Wrong on both counts. The kv-cache is likely to be offloaded to RAM or disk. What you have locally is just the log of messages. The kv-cache is the internal LLM state after having processed these messages, and it is a lot bigger.

  • > upload and restore it when the user starts their next interaction

    The data is the conversation (along with the thinking tokens).

    There is no download - you already have it.

    The issue is that it gets expunged from the (very expensive, very limited) GPU cache and to reload the cache you have to reprocess the whole conversation.

    That is doable, but as Boris notes it costs lots of tokens.

    • You're quite confidently wrong! :-)

      The kv-cache is the internal LLM state after having processed the tokens. It's big, and you do not have it locally.

  • I often see a local model QWEN3.5-Coder-Next grow to about 5 GB or so over the course of a session using llamacpp-server. I'd better these trillion parameter models are even worse. Even if you wanted to download it or offload it or offered that as a service, to start back up again, you'd _still_ be paying the token cost because all of that context _is_ the tokens you've just done.

    The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.

This sounds like a religious cult priest blaming the common people for not understanding the cult leader's wish, which he never clearly stated.

  • A strange view. The trade-off has nothing to do with a specific ideology or notable selfishness. It is an intrinsic limitation of the algorithms, which anybody could reasonably learn about.

    Sure, the exact choice on the trade-off, changing that choice, and having a pretty product-breaking bug as a result, are much more opaque. But I was responding to somebody who was surprised there's any trade-off at all. Computers don't give you infinite resources, whether or not they're "servers," "in the cloud," or "AI."

    • He was surprised because it was not clearly communicated. There's a lot of theory behind a product that you could (or could not) better understand, but in the end, something like price doesn't have much to do with the theoretical and practical behavior of the actual application.