Comment by 6keZbCECT2uB
20 hours ago
"On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6"
This makes no sense to me. I often leave sessions idle for hours or days and use the capability to pick it back up with full context and power.
The default thinking level seems more forgivable, but the churn in system prompts is something I'll need to figure out how to intentionally choose a refresh cycle.
Hey, Boris from the Claude Code team here.
Normally, when you have a conversation with Claude Code, if your convo has N messages, then (N-1) messages hit prompt cache -- everything but the latest message.
The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users. In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users.
We tried a few different approaches to improve this UX:
1. Educating users on X/social
2. Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)
3. Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post.
Hope this is helpful. Happy to answer any questions if you have.
I appreciate the reply, but I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.
I feel like that is a choice best left up to users.
i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"
Another way to think about it might be that caching is part of Anthropic's strategy to reduce costs for its users, but they are now trying to be more mindful of their costs (probably partly due to significant recent user growth as well as plans to IPO which demand fiscal prudence).
Perhaps if we were willing to pay more for our subscriptions Anthropic would be able to have longer cache windows but IDK one hour seems like a reasonable amount of time given the context and is a limitation I'm happy to work around (it's not that hard to work around) to pay just $100 or $200 a month for the industry-leading LLM.
Full disclosure: I've recently signed up for ChatGPT Pro as well in addition to my Claude Max sub so not really biased one way or the other. I just want a quality LLM that's affordable.
15 replies →
> I was never under the impression that gaps in conversations would increase costs
The UI could indicate this by showing a timer before context is dumped.
8 replies →
By caching they mean “cached in GPU memory”. That’s a very very scarce resource.
Caching to RAM and disk is a thing but it’s hard to keep performance up with that and it’s early days of that tech being deployed anywhere.
Disclosure: work on AI at Microsoft. Above is just common industry info (see work happening in vLLM for example)
1 reply →
> I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.
You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.
54 replies →
I too would far rather bear a token cost than have my sessions rot silently beneath my feet. I usually have ~5 running CC sessions, some of which I may leave for a week or two of inactivity at a time.
1 reply →
Instead of just dropping all the context, the system could also run a compaction (summarizing the entire convo) before dropping it. Better to continue with a summary than to lose everything.
1 reply →
Yes! This is what we’re trying next.
It'd probably be helpful for power users and transparency to actually show how the cache is being used. If you run local models with llamacpp-server, you can watch how the cache slots fill up with every turn; when subagents spawn, you see another process id spin up and it takes up a cache slot; when the model starts slowing down is when the context grows (amd 395+ around 80-90k) and the cache loads are bigger because you've got all that.
So yeah, it doesn't take much to surface to the user that the speed/value of their session is ephemeral because to keep all that cache active is computationally expensive because ...
You're still just running text through a extremely complex process, and adding to that text and to avoid re-calculation of the entire chain, you need the cache.
How else would you implement it?
Is there a way to say: I am happy to pay a premium (in tokens or extra usage) to make sure that my resumed 1h+ session has all the old thinking?
I understand you wouldn't want this to be the default, particularly for people who have one giant running session for many topics - and I can only imagine the load involved in full cache misses at scale. But there are other use cases where this thinking is critical - for instance, a session for a large refactor or a devops/operations use case consolidating numerous issue reports and external findings over time, where the periodic thinking was actually critical to how the session evolved.
For example, if N-4 was a massive dump of some relevant, some irrelevant material (say, investigating for patterns in a massive set of data, but prompted to be concise in output), then N-4's thinking might have been critical to N-2 not getting over-fixated on that dump from N-4. I'd consider it mission-critical, and pay a premium, when resuming an N some hours later to avoid pitfalls just as N-2 avoided those pitfalls.
Could we have an "ultraresume" that, similar to ultrathink, would let a user indicate they want to watch Return of the (Thin)king: Extended Edition?
I think it’s crazy that they do this, especially without any notice. I would not have renewed my subscription if I knew that they started doing this.
Especially in the analysis part of my work I don‘t care about the actual text output itself most of the time but try to make the model „understand“ the topic.
In the first phase the actual text output itself is worthless it just serves as an indicator that the context was processed correctly and the future actual analysis work can depend on it. And they‘re… just throwing most the relevant stuff out all out without any notice when I resume my session after a few days?
This is insane, Claude literally became useless to me and I didn’t even know it until now, wasting a lot of my time building up good session context.
There would be nothing lost if they said „If you click yes, we will prune your old thinking making Claude faster and saving you tons of tokens“. Most people would say yes probably so why not ask them… make it an env variable (that is announced not a secretly introduced one to opt out of something new!) or at least write it in a change log if they really don’t want to allow people to use it like before, so there‘d be chance to cancel the subscription in time instead of wasting tons of time on work patterns that not longer work
16 replies →
Why cant you just build a project document that outlines that prompt that you want to do? Or have claude save your progress in memory so you can pick it up later? Thats what I do. It seems abhorrent to expect to have a running prompt that left idle for long periods of time just so you can pick up at a moments whim...
2 replies →
Don't you have that by just resuming old convo?
The only issue is that it didn't hit the cache so it was expensive if you resume later.
3 replies →
These controversies erupt regularly, and I hope that you will see a common thing with most of them: you make a decision for your users without informing them.
Please fight this hubris. Your users matter. Many of us use your tools for everyday work and do not appreciate having the rug pulled from under them on a regular basis, much less so in an underhanded and undisclosed way.
I don't mind the bugs, these will happen. What I do not appreciate is secretly changing things that are likely to decrease performance.
A company that needs to anchor every single thing with the users will create a stale product.
2 replies →
While I hate all the gaslighting Anthropic seems to do recently (and the fact that their harness broke the code quality, while they forbid use of third party harnesses), making decisions for users is what UX is.
See also the difference between eg. MacOS (with large M, the older good versions) and waiting for "Year of linux on desktop".
I don't think the issue is making decisions for users, but trying to switch off the soup tap in the all-you-can-eat soup bar. Or, wrong business model setting wrong incentives to both sides.
Ahh that makes sense. Sometimes it's convenient to re-use an older conversation that has all the context I need. But maybe it's just the last 20% that's relevant.
It would be nice to be able to summarize/cut into a new leaner conversation vs having to coax all the context back into a fresh one. Something like keep the last 100,000 tokens.
I believe /compact achieves something like this? It just takes so long to summarize that it creates friction.
This violates the principle of least surprise, with nothing to indicate Claude got lobotomized while it napped when so many use prior sessions as "primed context" (even if people don't know that's what they were doing or know why it works).
The purpose of spending 10 to 50 prompts getting Claude to fill the context for you is it effectively "fine tunes" that session into a place your work product or questions are handled well.
// If this notion of sufficient context as fine tune seems surprising, the research is out there.)
Approaches tried need to deal with both of these:
1) Silent context degradation breaks the Pro-tool contract. I pay compute so I don't pay in my time; if you want to surface the cost, surface it (UI + price tag or choice), don't silently erode quality of outcomes.
2) The workaround (external context files re-primed on return) eats the exact same cache miss, so the "savings" are illusory — you just pushed the cost onto the user's time. If my own time's cheap enough that's the right trade off, I shouldn't be using your machine.
I don't envy you Boris. Getting flak from all sorts of places can't be easy. But thanks for keeping a direct line with us.
I wish Anthropic's leadership would understand that the dev community is such a vital community that they should appreciate a bit more (i.e. not nice sending lawyers afters various devs without asking nicely first, banning accounts without notice, etc etc). Appreciate it's not easy to scale.
OpenAI seems to be doing a much better job when it comes to developer relations, but I would like to see you guys 'win' since Anthropic shows more integrity and has clear ethical red lines they are not willing to cross unlike OpenAI's leadership.
As some others have mentioned.
I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session, not only the incremental question and answer.
(In understand under the hood that llms are n^2 by default but it's very counter intuitive - and given how popular cc is becoming outside of nerd circles, probably smaller and smaller fraction of users is aware of it)
I would like to decide on it case by case. Sometimes the session has some really deep insight I want to preserve, sometimes it's discardable.
I got exactly this warning message yesterday, saying that it could use up a significant amount of my token budget if I resumed the conversation without compaction.
5 replies →
> I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session
This feature has been live for a few days/weeks now, and with that knowledge I try remember to a least get a process report written when I'm for example close to the quota limit and the context is reasonably large. Or continue with a /compact, but that tends to lead to be having to repeat some things that didn't get included in the summary. Context management is just hard.
1 reply →
Thanks for giving more information. Just as a comment on (1), a lot of people don't use X/social. That's never going to be a sustainable path to "improve this UX" since it's...not part of the UX of the product.
It's a little concerning that it's number 1 in your list.
Is having massive sessions which sit idle for hours (or days) at a time considered unusual? That's a really, really common scenario for me.
Two questions if you see this:
1) if this isn't best practice, what is the best way to preserve highly specific contexts?
2) does this issue just affect idle sessions or would the cache miss also apply to /resume ?
Have the tool maintain a doc, and use either the built-in memory or (I prefer it this way) your own. I've been pretty critical of some other aspects of how Claude Code works but on this one I think they're doing roughly the right thing given how the underlying completion machinery works.
Edit: If you message me I can share some of my toolchain, it's probably similar to what a lot of other people here use but I've done some polishing recently.
1 reply →
The cache is stored on Antropics servers, since its a save state of the LLM's weights at the time of processing. its several gigs in size. Every SINGLE TIME you send a message and its a cache miss you have to reprocess the entire message again eating up tons of tokens in the process
1 reply →
Just wanted to say I appreciate your responses here. Engaging so directly with a highly critical audience is a minefield that you're navigating well.
Thank you.
I agree with this.
I'm writing this message even though I don't have much to add because it's often the case on HN that criticism is vocal and appreciation is silent and I'd like to balance out the sentiment.
Anthropic has fumbled on many fronts lately but engaging honestly like this is the right thing to do. I trust you'll get back on track.
> Engaging so directly with a highly critical audience is a minefield that you're navigating well.
They spent two months literally gaslighting this "critical audience" that this could not be happening and literally blaming users for using their vibe-coded slop exactly as advertised.
All the while all the official channels refused to acknowledge any problems.
Now the dissatisfaction and subscription cancellations have reached a point where they finally had to do something.
2 replies →
Very easy to do when you stand to make tens of millions when your employer IPOs. Let's not maybe give too much praise and employ some critical thinking here.
4 replies →
I leave sessions idle for hours constantly - that's my primary workflow. If resuming a 900k context session eats my rate limit, fine, show me the cost and let me decide whether to /clear or push through. You already show a banner suggesting /clear at high context - just do the same thing here instead of silently lobotomizing the model.
So if they fuck it up again and now they have, let’s say, “db problems” instead of “caching problems”, you would happily simply pay more? Wtf
2 replies →
I'm also a Claude Code user from day 1 here, back from when it wasn't included in the Pro/Max subscriptions yet, and I was absolutely not aware of this either. Your explanation makes sense, but I naively was also under the impression that re-using older existing conversations that I had open would just continue the conversation as is and not be a treated as a full cache miss.
My biggest learning here is the 1 hour cache window. I often have multiple Claudes open and it happens frequently that they're idle for 1+ hours.
This cache information should probably get displayed somewhere within Claude Code
Yep, agree. We added a little "/clear to save XXX tokens" notice in the bottom right, and will keep iterating on this. Thanks for being an early user!
3 replies →
Then you need to update your documentation and teach claude to read the new documentation because here is what claude code answered:
Question: Hey claude, if we have a conversation, and then i take a break. Does it change the expected output of my next answer, if there are 2 hours between the previous message end the next one?
Answer: No. A 2-hour gap doesn't change my output. I have no internal clock between messages — I only see the conversation content plus the currentDate context injected each turn. The prompt cache may expire (5 min TTL), which affects cost/latency but not the response itself.
-- This answer directly contradict your post. It seems like the biggest problem is a total lack of documentation for expected behavior.
A similar thing happens if I ask claude code for the difference between plan mode, and accept edits on.
Then Claude told me the only difference was that with plan mode it would ask for permission before doing edits. But I really don't think this is true. It seems like plan mode does a lot more work, and present it in a total different way. It is not just a "I will ask before applying changes" mode.
Don't be silly, they don't expect you to ask the Ai questions and get the right answers. Obviously if you want to know what's going on you should look at their first solution - check what advice they have posted on X...
This isn't how LLMs work. They aren't self aware like this, they're trained on the general internet. They might have some pointers to documentation for certain cases, but they generally aren't going to have specialized knowledge of themselves embedded within. Claude code has no need to know about its own internal programming, the core loop is just javascript code.
1 reply →
Resuming sessions after more than 1 hour is a very common workflow that many teams are following. It will be great if this is considered as an expected behaviour and design the UX around it. Perhaps you are not realising the fact that Claude code has replaced the shells people were using (ie now bash is replaced with a Claude code session).
I think thats a bad idea. It seems like expecting to have a prompt open like this, accumulating context puts a load on the back end. Its one of those things that is a bad habit. Like trying to maintain open tabs in a browser as a way to keep your work flow up to date when what you really should be doing is taking notes of your process and working from there.
I have project folders/files and memory stored for each session, when I come back to my projects the context is drawn from the memory files and the status that were saved in my project md files.
Create a better workflow for your self and your teams and do it the right way. Quick expect the prompt to store everything for you.
For the Claude team. If you havent already, I'd recommend you create some best practices for people that don't know any better, otherwise people are going to expect things to be a certain way and its going to cause a lot of friction when people cant do what the expect to be able to do.
2 replies →
> Resuming sessions after more than 1 hour is a very common workflow that many teams are following
Yeah it's called lunch!
This points to a fairly fundamental mismatch between the realities of running an LLM and the expectations of users. As a user, I _expect_ the cost of resuming X hours/days later to be no different to resuming seconds or minutes later. The fact that there is a difference, means it's now being compensated for in fairly awkward ways -- none of the solutions seem good, just varying degrees of bad.
Is there a more fundamental issue of trying to tie something with such nuanced costs to an interaction model which has decades of prior expectation of every message essentially being free?
> As a user, I _expect_ the cost of resuming X hours/days later to be no different to resuming seconds or minutes later.
As an informed user who understands his tools, I of course expect large uncached conversations to massively eat into my token budget, since that's how all of the big LLM providers work. I also understand these providers are businesses trying to make money and they aren't going to hold every conversation in their caches indefinitely.
1 reply →
This just does not match my workflow when I work on low-priority projects, especially personal projects when I do them for fun instead of being paid to do them. With life getting busy, I may only have half an hour each night with Claude to make some progress on it before having to pause and come back the next day. It’s just the nature of doing personal projects as a middle-aged person.
The above workflow basically doesn’t hit the rate limit. So I’d appreciate a way to turn off this feature.
Why does the system work like that? Is the cache local, or on Claude's servers?
Why not store the prompt cache to disk when it goes cold for a certain period of time, and then when a long-lived, cold conversation gets re-initiated, you can re-hydrate the cache from disk. Purge the cached prompts from disk after X days of inactivity, and tell users they cannot resume conversations over X days without burning budget.
The cache is on Antropics server, its like a freeze frame of the LLM inner workings at the time. the LLM can pick up directly from this save state. as you can guess this save state has bits of the underlying model, their secret sauce. so it cannot be saved locally...
8 replies →
We at UT-Austin have done some academic work to handle the same challenge. Will be curious if serving engines could modified. https://arxiv.org/abs/2412.16434 .
The core idea is we can use user-activity at the client to manage KV cache loading and offloading. Happy to chat more!!
> Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)
I feel like I'm missing something here. Why would I revisit an old conversation only to clear it?
To me it sounds like a prompt-cache miss for a big context absolutely needs to be a per-instance warning and confirmation. Or even better a live status indicating what sending a message will cost you in terms of input tokens.
This sounds like one of those problems where the solution is not a UX tweak but an architecture change. Perhaps prompt cache should be made long term resumable by storing it to disk before discarding from memory?
I agree.. Maybe parts of the cache contents are business secrets.. But then store a server side encrypted version on the users disk so that it can be resumed without wasting 900k tokens?
Disk where? LLM requests are routed dynamically. You might not even land in the same data center.
1 reply →
How does the Claude team recommend devs use Claude Code?
1) Is it okay to leave Claude Code CLI open for days?
2) Should we be using /clear more generously? e.g., on every single branch change, on every new convo?
Hi Boris
I'm curious why 1 hour was chosen?
Is increasing it a significant expense?
Ever since I heard about this behaviour I've been trying to figure out how to handle long running Claude sessions and so far every approach I've tried is suboptimal
It takes time to create a good context which can then trigger a decent amount of work in my experience, so I've been wondering how much this is a carefully tuned choice that's unlikely to change vs something adjustable
> We tried a few different approaches to improve this UX
how about acknowledging that you fucked up your own customers’ money and making a full refund for the affected period?
> Educating users on X/social
that is beyond me
ты не Борис, ты максимум борька
Boris, wait, wait, wait,
Why not use tired cache?
Obviously storage is waaay cheaper than recalculation of embeddings all the way from the very beginning of the session.
No matter how to put this explanation — it still sounds strange. Hell — you can even store the cache on the client if you must.
Please, tell me I’m not understanding what is going on..
otherwise you really need to hire someone to look at this!)
Same question I had in https://news.ycombinator.com/item?id=47819914
I still don't understand it, yes it's a lot of data and presumably they're already shunting it to cpu ram instead of keeping it on precious vram, but they could go further and put it on SSD at which point it's no longer in the hotpath for their inference.
I don't think you can store the cache on client given the thinking is server side and you only get summaries in your client (even those are disabled by default).
1 reply →
I assume they are already storing the cache on flash storage instead of keeping it all in VRAM. KV caches are huge - that’s why it’s impractical to transfer to/from the client. It would also allow figuring out a lot about the underlying model, though I guess you could encrypt it.
What would be an interesting option would be to let the user pay more for longer caching, but if the base length is 1 hour I assume that would become expensive very quickly.
3 replies →
reasonably, if i'm in an interactive session, its going to have breaks for an hour or more.
whats driving the hour cache? shouldnt people be able to have lunch, then come back and continue?
are you expecting claude code users to not attend meetings?
I think product-wise you might need a better story on who uses claude-code, when and why.
Same thing with session logs actually - i know folks who are definitely going to try to write a yearly RnD report and monthly timesheets based on text analysis of their claude code session files, and they're going to be incredibly unhappy when they find out its all been silently deleted
As with everything Anthropic recently this is a supply constraint issue. They have not planned for scale adequately.
So is it for latency or is it for cost?
Why did you lie 11 days ago, 3 days after the fix went in, about the cause of excess token usage?
Isn't that exactly what people had been accusing Anthropic of doing, silently making Claude dumber on purpose to cut costs? There should be, at minimum, a warning on the UI saying that parts of the context were removed due to inactivity.
That is understandable, but the issue is the sudden drop in quality and the silent surge in token usage.
It also seems like the warning should be in channel and not on X. If I wanted to find out how broken things are on X, I'd be a Grok user.
You created this issue by setting a timer for cache clearing. Time is really not a dimension that plays any role in how coding agent context is used.
It is too suprising. Time passed should not matter for using AI.
Either swallow the cost or be transparent to the user and offer both options each time.
Wow so that's why you did #2? The explanation in the CLI is really not clear. I thought it was just a suggestion to compact, no idea it was way more expensive than if I hadn't left it idle for an hour.
You guys really need to communicate that better in the CLI for people not on social
Why not automatically run a compaction close to the 1-hour mark? Then the cache miss won’t have such a bad impact.
Hi Boris! Wanted to let you know that I find those ads with you saying "now when you code, you use an agent" obnoxious because of that incorrect statement. I have no interest in slop coding. I find it way more ergonomic and effective to use code to tell a machine precisely what to do than to use English to tell it vaguely. I hate that your ad is misleading so many non-coders, who will actually believe your lie that nobody codes anymore. Probably doesn't help that YouTube was playing it as an interruption in every video I watched. I probably saw it 100 times and was getting to the "throw the remote at the tv" stage XD.
So you made this change completely invisible to the user, without the user being able to choose between the two behaviors, and without even documenting it in the (extremely verbose) changelog [1]? I can't find it, the Docs Assistant can't find it (well, it "I found it!" three times being fed your reply with a non-matching item).
I frequently debug issues while keeping my carefully curated but long context active for days. Losing potentially very important context while in the middle of a debugging session resulting in less optimal answers, is costing me a lot more money than the cache misses would.
In my eyes, Claude Code is mainly a context management tool. I build a foundation of apparent understanding of the problem domain, and then try to work towards a solution in a dialogue. Now you tell me Anthrophic has been silently breaking down that foundation without telling me, wasting potentially hours of my time.
It's a clear reminder that these closed-source harnesses cannot be trusted (now or in the future), and I should find proper alternatives for Claude Code as soon as possible.
[1] https://code.claude.com/docs/en/changelog
> The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users.
I dont agree with this being characterized as a "corner case".
Isn't this how most long running work will happen across all serious users?
I am not at my desk babysitting a single CC chat session all day. I have other things to attend to -- and that was the whole point of agentic engineering.
Dont CC users take lunch breaks?
How are all these utterly common scenarios being named as corner cases -- as something that is wildly out of the norm, and UX can be sacrificed for those cases?
> We tried a few different approaches to improve this UX: 1. Educating users on X/social
No. You had random developers tweet and reply at random times to random users while all of your official channels were completely silent. Including channels for people who are not terminally online on X
There's a cultural divide between SV and the 85% of SMB using M365, for example. When everyone you know uses a thing, I mean, who doesn't?*
There's a reason live service games have splash banners at every login. No matter what you pick as an official e-coms channel, most of your users aren't there!
* To be fair, of all these firms, ANTHROP\C tries the hardest to remember, and deliver like, some people aren't the same. Starting with normals doing normals' jobs.
How big is the cache? Could you just evict the cache into cheap object storage and retrieve it when resuming? When the user starts the conversation back up show a "Resuming conversation... ⭕" spinner.
> that would be >900k tokens written to cache all at once
Probably that's why I hit my weekly limits 3-4 days ago, and was scheduled to reset later today. I just checked, and they are already reset.
Not sure if it's already done, shouldn't there be a check somewhere to alert on if an outrageous number of tokens are getting written, then it's not right ?
maybe you could surface an expected cache miss to the user
You need to seriously look at your corporate communications and hire some adults to standarise your messaging, comms and signals. The volatility behind your doors is obvious to us and you'd impress us much more if you slowed down, took a moment to think about your customers and sent a consistent message.
You lost huge trust with the A/B sham test. You lost trust with enshittification of the tokenizer on 4.6 to 4.7. Why not just say "hey, due to huge input prices in energy, GPU demand and compute constraints we've had to increase Pro from $20 to $30." You might lose 5% of customers. But the shady A/B thing and dodgy tokenizer increasing burn rate tells everyone inc. enterprise that you don't care about honesty and integrity in your product.
I hope this feedback helps because you still stand to make an awesome product. Just show a little more professionalism.
2. could you bring back the _compact and accept plan_? even if it is not the default option.
For idle sessions I would MUCH rather pay the cost in tokens than reduced quality. Frankly, it's shocking to me that you would make that trade-off for users without their knowledge or consent.
what about selling long term cache space to users?
or even, let the user control the cache expiry on a per request basis. with a /cache command
that way they decide if they want to drop the cache right away , or extend it for 20 hours etc
it would cost tokens even if the underlying resource is memory/SSD space, not compute
From a utility perspective using a tiered cache with some much higher latency storage option for up to n hours would be very useful for me to prevent that l1 cache miss.
Why is time the variable you're solving for? Why can't I keep that cache warm by keeping the session open?
Hi, thanks for Claude Code. I was wondering though if you'd considering adding a mode to make text green and characters come down from the top of the screen individually, like in The Matrix?
So this explains why resuming a session after a 5-hour timeout basically eats most of the next session. How then to avoid this?
I actually have a suggestion here - do not hide token count in non-verbose mode in Claude Code.
I drop sessions very frequently to resume later - that's my main workflow with how slow Claude is. Is there anything I can do to not encounter this cache problem?
Wasn’t cache time reduced to 5 minutes? Or is that just some users interpretation of the bug?
What about:
/loop 5m say "ok".
Will that keep the cache fresh?
Sorry but I think this should be left up to the user to decide how it works and how they want to burn their tokens. Also a countdown timer is better than all of these other options you mention.
The entire reason I keep a long-lived session around is because the context is hard-won — in term of tokens and my time.
Silently degrading intelligence ought to be something you never do, but especially not for use-cases like this.
I’m looking back at my past few weeks of work and realizing that these few regressions literally wasted 10s of hours of my time, and hundreds of dollars in extra usage fees. I ran out of my entire weekly quota four days ago, and had to pause the personal project I was working on.
I was running the exact same pipeline I’ve run repeatedly before, on the same models, and yet this time I somehow ate a week’s worth of quota in less than 24h. I spent $400 just to finish the pipeline pass that got stuck halfway through.
I’m sorry to be harsh, but your engineering culture must change. There are some types of software you can yolo. This isn’t one of them. The downstream cost of stupid mistakes is way, way too high, and far too many entirely avoidable bugs — and poor design choices — are shipping to customers way too often.
> The entire reason I keep a long-lived session around is because the context is hard-won — in term of tokens and my time. Silently degrading intelligence ought to be something you never do, but especially not for use-cases like this.
Hard agree, would like to see a response to this.
as a variation:
how does this help me as a customer? if i have to redo the context from scratch, i will pay both the high token cost again, but also pay my own time to fill it.
the cost of reloading the window didnt go away, it just went up even more
> I’m sorry to be harsh, but your engineering culture must change. There are some types of software you can yolo. This isn’t one of them. The downstream cost of stupid mistakes is way, way too high, and far too many entirely avoidable bugs — and poor design choices — are shipping to customers way too often.
I have to imagine this isn't helped by working somewhere where you effectively have infinite tokens and usage of the product that people are paying for, sometimes a lot.
> tokens written to cache all at once, which would eat up a significant % of your rate limits
Construction of context is not an llm pass - it shouldn't even count towards token usage. The word 'caching' itself says don't recompute me.
Since the devs on HN (& the whole world) is buying what looks like nonsense to me - what am I missing?
It astounds me that a company valued in the hundreds-of-billions-of-dollars has written this. One of the following must be true:
1. They actually believed latency reduction was worth compromising output quality for sessions that have already been long idle. Moreover, they thought doing so was better than showing a loading indicator or some other means of communicating to the user that context is being loaded.
2. What I suspect actually happened: they wanted to cost-reduce idle sessions to the bare minimum, and "latency" is a convenient-enough excuse to pass muster in a blog post explaining a resulting bug.
It’s certainly #2. They have shown over dozens of decisions they move very quickly, break stuff, then have to both figure out what broke and how to explain it.
It’s definitely a cost / resource saving strategy on their end.
It's very weird that they frame caching as "latency reduction" when it comes to a cloud service. I mean, yes, technically it reduces latency, but more importantly it reduces cost. Sometimes it's more than 80% of the total cost.
I'm sure most companies and customers will consider compromising quality for 80% cost reduction. If they just be honest they'll be fine.
The same company that claims they have models that are too "dangerous" to release btw.
what's even more amazing is it took them two weeks to fix what must have been a pretty obvious bug, especially given who they are and what they are selling.
they just vibecoded a fix and didnt think about the tradeoff they were making and their always yes-man of a model just went with it
Yeah this is actually quite shocking. In my earlier uses of CC I might noodle on a problem for a while, come back and update the plan, go shower, think, give CC a new piece of advice, etc. Basically treating it like a coworker. And I thought that it was a static conversation (at least on the order of a day or so). An hour is absurd IMO and makes me want to rethink whether I want to keep my anthropic plan.
It's also a bit of a fishy explanation for purging tokens older than an hour. This happens to also be their cache limit. I doubt it is incidental that this change would also dramatically drop their cost.
They moved it to 5m around the same timeframe though: https://www.reddit.com/r/ClaudeAI/comments/1sk3m12/followup_...
Seems like it would interact very badly with the time based usage reset. If lots of people are hitting their limit and then letting the session idle until they can come back, this wouldn't be an exception. It would almost be the default behaviour.
Wow, I always thought the context is always stored locally and this is something I have control over.
Glad I use kiro-cli which doesn't do this.
you might be biased due to your employment :)