Comment by bcherny
16 hours ago
Hey all, Boris from the Claude Code team here.
We've been investigating these reports, and a few of the top issues we've found are:
1. Prompt cache misses when using 1M token context window are expensive. Since Claude Code uses a 1 hour prompt cache window for the main agent, if you leave your computer for over an hour then continue a stale session, it's often a full cache miss. To improve this, we have shipped a few UX improvements (eg. to nudge you to /clear before continuing a long stale session), and are investigating defaulting to 400k context instead, with an option to configure your context window to up to 1M if preferred. To experiment with this now, try: CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000 claude.
2. People pulling in a large number of skills, or running many agents or background automations, which sometimes happens when using a large number of plugins. This was the case for a surprisingly large number of users, and we are actively working on (a) improving the UX to make these cases more visible to users and (b) more intelligently truncating, pruning, and scheduling non-main tasks to avoid surprise token usage.
In the process, we ruled out a large number of hypotheses: adaptive thinking, other kinds of harness regressions, model and inference regressions.
We are continuing to investigate and prioritize this. The most actionable thing for people running into this is to run /feedback, and optionally post the feedback ids either here or in the Github issue. That makes it possible for us to debug specific reports.
Boris, you're seeing a ton of anecdotes here and Claude has done something that has affected a bunch of their most fervent users.
Jeff Bezos famously said that if the anecdotes are contradicting the metrics, then the metrics are measuring the wrong things. I suggest you take the anecdotes here seriously and figure out where/why the metrics are wrong.
On the subject of metrics, better user-facing metrics to understand and debug usage patterns would be a great addition. I'd love an easier way to understand the ave cost incurred by a specific skill, for example. (If I'm missing something obvious, let me know.)
Baking deeper analytics into CC would be helpful... similar to ccusage perhaps: https://github.com/ryoppippi/ccusage
This is useful if you want to keep an eye on what claude's actually doing behind the scenes: https://github.com/simple10/agents-observe
[dead]
We are taking it seriously, and are continuing to investigate. We are not trusting the metrics.
The quantitative ux research team at Google was created for exactly this problem: a service which became popular before the right metrics existed, meaning metrics need to be derived first, then optimized. We would observe users (irl), read their logs, then generate experiments to improve the behavior as measured by logs, and return to see if the experiment improves irl experiences. There were not many of us and we are around :)
3 replies →
Thank you
Hopefully yourself, and not via your ai tools.
Cool, are you going to be transparent and explain the metrics and costs as a postmortem? And given the inability to actually audit what you produce, why should we trust Anthropic?
55 replies →
But the default 1M context window just rolled out a few weeks ago. If refreshing old sessions on 1M context windows is the problem, it's completely aligned with what Boris is saying.
Why did this become an issue seemingly overnight when 1M context has been available for a while, and I assume prompt caching behavior hasn't changed?
EDIT: prompt caching behavior -did- change! 1hr -> 5min on March 6th. I'm not sure how starting a fresh session fixes it, as it's just rebuilding everything. Why even make this available?
It feels like the rules changed and the attitude from Anth is "aw I'm sorry you didn't know that you're supposed to do that." The whole point of CC is to let it run unattended; why would you build around the behavior of watching it like a hawk to prevent the cache from expiring?
> 1hr -> 5min on March 6th
This is not accurate. The main agent typically uses a 1h cache (except for API customers, which can enable 1h but it is not on by default because it costs more). Sub-agents typically use a 5m cache.
https://github.com/anthropics/claude-code/issues/46829#issue... - Have you checked with your colleague? (and his AI, of course)
1 reply →
Then my original question stands: why did this become an issue seemingly overnight if nothing changed?
So if I run a test suite or compile my rust program in a sub agent I’m going to get cache misses? Boo.
3 replies →
... so how do API users enable 1hr caching? I haven't found a setting anywhere.
1 reply →
For me definitely the worst regression was the system prompt telling claude to analyze file to check if it's malware at every read. That correlates with me seeing also early exhausted quotas and acknowledgments of "not a malware" at almost every step.
It is a horrible error of judgement to insert a complex request for such a basic ability. It is also an error of judgement to make claude make decisions whether it wants to improve the code or not at all.
It is so bad, that i stopped working on my current project and went to try other models. So far qwen is quite promising.
I don't think that's accurate. The malware prompt has been around since Sonnet 3.7. We carefully evaled it for each new model release and found no regression to intelligence, alongside improved scores for cyber risk. That said, we have removed the prompt for Opus 4.6 since it no longer needed it.
I started seeing "not a malware, continuing" in almost every reply since around 2 weeks ago. Maybe you just reintroduced it with some regression? Opus 4.6
10 replies →
The /clear nudge isn't a solution though. Compacting or clearing just means rebuilding context until Claude is actually productive again. The cost comes either way. I get that 1M context windows cost more than the flat per-token price reflects, because attention scales with context length, but the answer to that is honest pricing or not offering it. Not annoying UX nudges. What’s actually indefensible is that Claude is already pushing users to shrink context via, I presume, system prompt. At maybe 25% fill:
If there’s a cost problem, fix the pricing or the architecture. But please stop the model and UI from badgering users into smaller context windows at every opportunity. That is not a solution, it’s service degradation dressed as a tooltip.
The cost issues they're seeing (at least from what they've stated) are from users, not internally. Basically, it takes either $5 or $6.25 (depending on 5m or 1h ttl) to re-ingest a 1M context length conversation into cache for opus 4.6, that's obviously a very high cost, and users are unhappy with it.
I think 400k as a default seems about right from my experience, but just having the ability to control it would be nice. For the record, even just making a tool call at 1M tokens costs 50 cents (which could be amortized if multiple calls are made in a round), so imo costs are just too high at long context lengths for them to be the default.
currently "clear makes it worse" https://github.com/anthropics/claude-code/issues/47098 + https://github.com/anthropics/claude-code/issues/47107
launching with `CLAUDE_CODE_DISABLE_GIT_INSTRUCTIONS=1 claude "Hello"` till those are fixed seems to be th way
I suspect 1M token context is questionable value because of the secondary effect of burning quota vs getting work done.
I think the model select that let me choose 1M made sense because I could decide if I was working on large documents and compacting more often was more effective.
Hey Boris - why is the best way to get support making a Hacker News or X post, and hoping you reply? Why does Anthropic Enterprise Support never respond to inquiries?
I mean if we're building an unrelated wishlist... Can 20x max users get auto mode already? Or can the enterprise plans get something equivalent to 20x max?
Given I'm running two max accounts to get the usage I want, can we get a 25x and 40x tier? :-)
I don't want a nudge. I want a clear RED WARNING with "You've gone away from your computer a bit too long and chatted too much at the coffee machine. You're better off starting a new context!"
I don’t want a scary red message chastising me for not being responsive enough!
I often leave CC hanging (or even suspended) and use /resume a lot. I’m okay with that having some negative effect on my token limits.
Product design is hard. They can’t please us all. I don’t envy the team considering these trade offs.
Ack, it is currently blue but we can make it red
Why is nobody even asking why that should be an issue? No other text editor shits the bed that way. The whole point of the computer is that it patiently waits for my input.
let me put this way: not your ram, not your cache, not waiting patiently for your input.
2 replies →
forget the warning, just compact like someone suggested in the ticket. Who would opt for a massive cache miss?
OpenAI (Codex) keeps on resetting the usage limits each time they fuck up...
I have yet to see Anthropic doing the same. Sorry but this whole thing seems to be quite on purpose.
Can you clearly state what they messed up?
Suddenly burning up the quota ~4x faster than usual is not a mess up in your opinion?
2 replies →
Not parent but I can guess from watching mostly from the sidelines.
They introduced a 1M context model semi-transparently without realizing the effects it would have, then refused to "make it right' to the customer which is a trait most people expect from a business when they spend money on it, specially in the US, and specially when the money spent is often in the thousands of dollars.
Unless anthropic has some secret sauce, I refuse to believe that their models perform anywhere near the same on >300k context sizes than they do on 100k. People don't realize but even a small drop in success rate becomes very noticeable if you're used to have near 100%, i.e. 99% -> 95% is more noticeable than 55% -> 50%.
I got my first claude sub last month (it expires in 4 days) and I've used it on some bigish projects with opencode, it went from compacting after 5-10 questions to just expanding the context window, I personally notice it deteriorating somewhere between 200-300k tokens and I either just fork a previous context or start a new one after that because at that size even compacting seems to generate subpar summaries. It currently no longer works with opencode so I can't attest to how it well it worked the past week or so.
If the 1M model introduction is at fault for this mass user perception that the models are getting worse, then it's anthropics fault for introducing confusion into the ecosystem. Even if there was zero problems introduced and the 1M model was perfect, if your response when the users complain is to blame it on the user, then don't expect the user will be happy. Nobody wants to hear "you're holding it wrong", but it seems that anthropic is trying to be apple of LLMs in all the wrong ways as well.
2 replies →
[flagged]
Different users do seem to be encountering problems or not based on their behavior, but for a rapidly-evolving tool with new and unclear footguns, I wouldn't characterize that as user error.
For example, I don't pull in tons of third-party skills, preferring to have a small list of ones I write and update myself, but it's not at all obvious to me that pulling in a big list of third-party skills (like I know a lot of people do with superpowers, gstack, etc...) would cause quota or cache miss issues, and if that's causing problems, I'd call that more of a UX footgun than user error. Same with the 1M context window being a heavily-touted feature that's apparently not something you want to actually take advantage of...
Me and my colleagues faced, over the last ~1 month or so, the same issues.
With a new version of Claude Code pretty much each day, constant changes to their usage rules (2x outside of peak hours, temporarily 2x for a few weeks, ...), hidden usage decisions (past 256k it looks like your usage consumes your limits faster) and model degradation (Opus 4.6 is now worse than Opus 4.5 as many reported), I kind of miss how it can be an user error.
The only user error I see here is still trusting Anthropic to be on the good side tbh.
If you need to hear it from someone else: https://www.youtube.com/watch?v=stZr6U_7S90
2 replies →
Why did it suddenly become an issue, despite prompt caching behavior being unchanged?
PEBKAC: Problem Exists Between Keyboard And Chair
Yes same here. I use CC almost constantly every day for months across personal and work max/team accounts, as well as directly via API on google vertex. I have hardly ever noticed an issue (aside from occasional outages/capacity issues, for which I switch to API billing on Vertex). If anything it works better than ever.
You know that people are not using the same resources? It's like 9 out of 10 computers get borked and you have the 1 that seems okay and you essentially say "My computer works fine, therefore all computers work fine." Come on dude.
Money money money money
Hi, thanks for Claude Code. I was wondering though if you'd considering adding a mode to make text green and characters come down from the top of the screen individually, like in The Matrix?
Would it be possible to increase the cache duration if misses are a frequent source of problems?
Maybe using a heartbeat to detect live sessions to cache longer than sessions the user has already closed. And only do it for long sessions where a cache miss would be very expensive.
Yes, we're trying a couple of experiments along these lines. Good intuition.
I’ve seen the /clear command prompt and I found the verbiage to be a bit unclear. I think clarifying that the cache has expired and providing an understandable metric on the impact - ie “X% of your 5-hour window” for Pro/Mad users and details on token use for API users. A pop-up that requires explicit acknowledgment might also help, although that could be more of an annoyance to enterprise users.
One pattern I use frequently is using one high level design and implementation agent that I’ll use for multiple sessions and delegate implementation to lower level agents.
In this case it’d be helpful to have one of two options:
1. If Claude CLI could create an auto compaction of the conversation history before cache expiration. For example, if I’m beyond X minutes or Y prompts in a conversation and I’ve been inactive for a threshold it could auto-compact close to the expiration and provide that as an option on resume. 2. If I could configure cache expiration proactively and Anthropic could use S3 or a similar slow load mechanism to offload the cache for a longer period - possibly 24-72h.
I can appreciate that longer KV cache expiration would complicate capacity management and make inference traffic less fungible but I wouldn’t mind waiting seconds to minutes for it to load from a slower store to resume without quota hits.
Number 2 makes me chuckle honestly. Too many people going down the 10x rabbit holes on youtube. Next up, a framework that 100xs your workflow. You know its good because it comes with 300 agents and 20 mcp servers and 1200 skills
You've created quite a conundrum.
The only people who are going to run into issues are superpower users who are running this excessively beyond any reasonable measure.
Most people are going to be quite happy with your service. But at the same time, and this is just a human nature thing people are 10 times more likely we complain about an issue than to compliment something working well.
I don't know how to fix this, but I strongly suspect this isn't really a technical issue. It's more of a customer support one.
As another data point, I pay for Pro for a personal account, and use no skills, do nothing fancy, use the default settings, and am out of tokens, with one terminal, after an hour. This is typically working on a < 5,000 line code base, sometimes in C, sometimes in Go. Not doing incredibly complicated things.
Ah, so cache usage impacts rate limits. There goes the ”other harnesses aren’t utilizing the cache as efficiently” argument.
Claude Code is the most prompt cache-efficient harness, I think. The issue is more that the larger the context window, the higher the cost of a cache miss.
I do wonder if it's fair to expect users to absorb cache miss costs when using Claude Code given how untransparent these are.
That might be, but the argument was that poor cache utilization was costing Anthropic too much money in other harnesses. If cache is considered in rate limits, it doesn’t matter from a cost perspective, you’ll just hit your rate limits faster in other harnesses that don’t try to cache optimize.
4 replies →
Politely, no.
- I wrote an extension in Pi to warm my cache with a heartbeat.
- I wrote another to block submission after the cache expired (heartbeats disabled or run out)
- I wrote a third to hard limit my context window.
- I wrote a fourth to handle cache control placement before forking context for fan out.
- my initial prompt was 1000 tokens, improving cache efficiency.
Anthropic is STOMPING on the diversity of use cases of their universal tool, see you when you recover.
I’m sorry but when you wake up in the morning with 12% of your session used, saying “it’s the cache” is not an appropriate answer.
And I’m using Claude on a small module in my project, the automations that read more to take up more context are a scam.
> Since Claude Code uses a 1 hour prompt cache window for the main agent
this seems a bit awkward vs the 5 hour session windows.
if i get rate limited once, I'll get rate limited immediately again on the same chat when the rate limit ends?
any chance we can get some form of deffered cache so anything on a rate limited account gets put aside until the rate limit ends?
Have you considered poking the cache?
When a user walks away during the business day but CC is sitting open, you can refresh that cache up to 10x before it costs the same as a full miss. Realistically it would be <8x in a working day.
Am I so out of touch?
No! It’s the children who are wrong!
> Since Claude Code uses a 1 hour prompt cache window for the main agent, if you leave your computer for over an hour then continue a stale session, it's often a full cache miss. To improve this, we have shipped a few UX improvements (eg. to nudge you to /clear before continuing a long stale session), and are investigating defaulting to 400k context instead
I don’t understand this. I frequently have long breaks. I never want to clear or even compact because I don’t want to lose the conversations that I’ve had and the context. Clearing etc causes other issues like I have to restate everything at times and it misses things. I do try to update the memory which helps. I wish there was a better solution than a time bound cache
Makes me wish that shortly before the server-side expiration, we could save the cache on the client-side, indefinitely.
But my understanding is that we're talking about ~60GB of data per session, so it sounds unrealistic to do...
Where are you getting 60GB from? It shouldn’t be that large.
But yes, would love to save context/cache such that it can be played back/referred to if needed.
/compact is a little black box that I just have to trust that is keeping the important bits.
1 reply →
Boris,
Even if Anthropic is working in good faith to lower infrastructure costs, developers need more than 5 minutes to notice that CC completed a task, review its changes and ask it to merge. Only developers who do not review code changes can live with such a TTL...
Consider making this value configurable as the ideal TTL value is different for each person. If people are willing to pay more for 30 minutes TTL than 5 minutes, they should be able to.
Hi Boris,
Long term claude code user here. Is the first time i've had to setup a hook to codex to review claude output.
Is hallucinating like never before
Is missing key concepts/instructions in context like never before
Is writing bad code that will "pass test" much more. Before it use to try be critic and do good code, now it will try to hack test and bypass intructions for a green pass.
Thank you for your responses, especially on a Sunday. They give us some insights and at least a couple temporary workarounds to use, while the issues are being addressed :) much appreciated
Claude Code cache is not 1 hour. There is a "Closed as not planned" issue in GitHub that confirms that it has been moved to 5 minutes since March: https://github.com/anthropics/claude-code/issues/46829. I started seeing the massive degradation exactly on the 23rd of March, hence after a few days I unsubscribed because it was completely unusable, with a ~5h session being depleted in as little as 15-20 mins.
Could we get an option to use Opus with a smaller context window? I noticed that results get much worse way earlier than when you reach 1M tokens, and I would love to have a setting so that I could force a compaction at eg 300k tokens.
You probably just missed it in his post, but:
"To experiment with this now, try: CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000 claude."
Maybe try changing the 4 to a 3 and see if that works for you?
Thank you, will definitely try that!
> defaulting to 400k context instead, with an option to configure your context window to up to 1M if preferred
This seems really useful!
I'm surprised that "Opus 4.6" (200K) and "Opus 4.6 1M" are the only Opus options in the desktop app, whereas in the CLI/TUI app you don't seem to even get that distinction.
I bet that for a lot of folks something like 400k, 600k or 800k would work as better defaults, based on whatever task they want to work on.
Boris, wasnt this the same thing ~2 weeks ago? Is it the same cache misses as before? What's the expected time till solved? Seems like its taking a while
Resizing the context window seems like a very good idea to me. I noticed a decline of productivity when the 1M context window was released and I'd like to bring it back to 200k, because it was totally fine for the things I was working on.
shouldn't compaction be interactive with the user as to what context will continue to be the most relevant in the future??? what if the harness allowed for a turn to clarify the user's expected future direction of the conversation and did the consolidation based upon the addition info?
there definitely seems to be a benefit to pruning the context and keeping the signal to noise high wrt what is still to be discussed.
Why are you all of a sudden running into so many issues like this? Could it be that all of the Anthropics employees have completely unlimited and unbounded accounts, which means you don't get a feeling of how changes will affect the customers?
The number of people using Claude Code has grown very quickly, which means:
- More configurations and environments we need to test
- Given an edge/corner case, it is more likely a significant number of users run into it
- As the ecosystem has grown, more people use skills and plugins, and we need to offer better tools and automation to ensure these are efficient
We do actually dogfood rate limits, so I think it's some combination of the above.
I think the suspicion regarding skills and plugins is fair and logical. And it is absolutely the case that some use significantly more tokens.
with that said, on my 5x plan, I could have multiple sessions working and the limit was far away. Around when you introduced the whole more tokens during off-peak hours and fewer tokens during working US hours, Even with a single session, using no plugins at all (I uninstalled OMC) I run into limits very often.
I have not performed any rigorous tests but it feels like I have about 25% of what I used to have or less. This is all without using teams of agents, or ralph loops or anything like that. Just /plan and execute in a single session. I have restored the /clear context before executing plan to try and mitigate things. I will also try the 400k context since, in my experience, the 1M tokens have not made Opus 4.6 noticeably smarter for my small webapp use-case.
Best of luck to you!
ps: whenever you introduce a change, please make it optional AND ask the user about it at first. Don't just yank things suddenly (like the /clear context and apply plan option.) as I spent hours trying to figure out how I broke it before I saw your note and how to re-enable it.
Because it’s completely vibe coded? And the codebase goes through massive churn, which means things that were stable get rewritten possibly with bugs.
You can get Claude Code to write tests too...
Have you tried asking Mythos for a fix?
Where can i learn about concepts like prompt cache misses? I don't have a mental model how that interacts with my context of 1M or 400k tokens... I can cargo cult follow instructions of course but help us understand if you can so we can intelligently adapt our behavior. Thanks.
The docs are a good place to start: https://platform.claude.com/docs/en/build-with-claude/prompt...
Thanks. Just noting that those docs say the cache duration is 5 min and not 1 hour as stated in sibling comment:
> By default, the cache has a 5-minute lifetime. The cache is refreshed for no additional cost each time the cached content is used. > > If you find that 5 minutes is too short, Anthropic also offers a 1-hour cache duration at additional cost.
1 reply →
And why does /clear help things? Doesn't that wipe out the history of that session? Jeez.
I have a feature request: I build an mcp server, but now it has over 60 tools. Most sessions i really don’t need most of them. I suppose I could make this into several servers. But it would maybe be nice to give the user more power here. Like let me choose the tools that should be loaded or let me build servers that group tools together which can be loaded. Not sure if that makes sense …
from looking at the raw requests, that cant seem right?
its all "cache_control": { "type": "ephemeral" } there is no "ttl" anywhere.
// edit: cc_version=2.1.104.f27
Hello Boris! How do I increase the 1 hour prompt cache window for the main agent? I would love to be able to set that to, say, 4 hours. That gives me enough time to work on something, go teach a class, grab a snack, and come back and pick up where I left off.
How can we turn of 1m context? I don't find it has ever helped.
He mentioned this in his original comment:
"CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000"
Pulling all the skills and agents in the world in, when unused are a big hit. I deleted all of mine and added back as needed and there was an improvement.
Running Claude Cowork in the background will hit tokens and it might not be the most efficient use of token use.
Last, but not least, turning off 1M token context by default is helpful.
There's an issue someone raised showing that prompt caches are only 5 minutes.
The reply seems to be: oh huh, interesting. Maybe that's a good thing since people sometimes one-shot? That doesn't feel like the messaging I want to be reading, and the way it conflicts with the message here that cache is 1 hour is confusing.
https://news.ycombinator.com/item?id=47741755
Is there any status information or not on whether cache is used? It sure looks like the person analyzing the 5m issue had to work extremely hard to get any kind of data. It feels like the iteration loop of people getting better at this stuff would go much much better if this weren't such a black box, if we had the data to see & understand: is the cache helping?
Aren’t they saying that it’s 5minutes for things like subagents (that wouldn’t benefit from it?)
> To improve this, we have shipped a few UX improvements (eg. to nudge you to /clear before continuing a long stale session)
Is this really an improvement? Shouldn't this be something you investigate before introducing 1M context?
What is a long stale session?
If that's not how Claude Code is intended to be used it might as well auto quit after a period of time. If not then if it's an acceptable use case users shouldn't change their behavior.
> People pulling in a large number of skills, or running many agents or background automations, which sometimes happens when using a large number of plugins.
If this was an issue there should have been a cap on it before the future was released and only increased once you were sure it is fine? What is "a large number"? Then how do we know what to do?
It feels like "AI" has improved speed but is in fact just cutting corners.
Eh you say that every time and yet it keeps happening.
Boris, is the KV cache TTL now reduced to 5 minutes from 1 hour?
I think this may be the biggest concern for people building tools on the API: https://github.com/anthropics/claude-code/issues/46829
I would argue that KV caching is a net gain for Ant and a well-maintained cache is the biggest thing that can generate induced demand and a thriving third party ecosystem. https://safebots.ai/papers/KV.pdf
Can you explain why Opus 4.6 suddenly becomes dumb as a sack of potatoes, even if context is barely filled?
Can you explain why Opus 4.6 will be coming up with stupid solutions only to arrive at a good one when you mention it is trying to defraud you?
I have a feeling the model is playing dumb on purpose to make user spend more money.
This wasn't the case weeks ago when it actually working decently.
Wait what? If I get told to come back in three hours because I'm using the product too much, I get penalized when I resume?
What's the right way to work on a huge project then? I've just been saying "Please continue" -- that pops the quota?
I wish people would pay more attention to:
* Anthropic is in some way trying to run a business (not a charity) and at least (eventually?) make money and not subsidize usage forever
* "What a steal/good deal" the $100-$200/mo plans are compared to if they had to pay for raw API usage
and less on "how dare you reserve the right to tweak the generous usage patterns you open-ended-ly gave us, we are owed something!"
As an (ex) paying customer, I'm expecting some consistency. I used to be satisfied with the value I got, until the limits changed overnight, and I'd get a ten of my previous usage.
If Anthropic is allowed to alter the deal whenever, then I'd expect to be able to get my money back, pro-rata, no questions asked.
All those apply to OpenAI+Codex too, but they're far more generous with limits than Anthropic, and with granting fresh limits to apologize when they fuck up.
[flagged]
[flagged]