Comment by turnsout
18 hours ago
This is probably entirely down to subtle changes to CC prompts/tools.
I've been using CC more or less 8 hrs/day for the past 2 weeks, and if anything it feels like CC is getting better and better at actual tasks.
Edit: Before you downvote, can you explain how the model could degrade WITHOUT changes to the prompts? Is your hypothesis that Opus 4.5, a huge static model, is somehow changing? Master system prompt changing? Safety filters changing?
Honest, good-faith question.
Is CC getting better, or are you getting better at using it? And how do you know the difference?
I'm an occasional user, and I can definitely see improvements in my prompts over the past couple of months.
I agree with you, it's personally hard to tell.
For me I've noticed it getting nothing but better over the past couple months, but I've been working on my workflows and tooling.
For example, I used to use plan mode and would put everything in a single file and then ask it to implement it in a new session.
Switching to the 'superpowers' plugin with its own skills to brainstorm and write plans and execute plans with batches and tasks seems to have made a big improvement and help catch things I wouldn't have before. There's a "get shit done" plugin that's similar that I want to explore as well.
The code output always looks good to me for the most part though and I've never thought that it's getting dumber anything, so I feel like a lot of the improvements I see are because of a skill issue on my part trying to use everything. Obviously it doesn't help there's a new way to do things every two weeks though.
I run an LLM based product in a completely different space (consumer) and I think this is kind of an impossible unsolvable part of developing products that rely on LLMs.
No matter what, powers users always say the model is degrading over time*. Even when every stat I have access to says otherwise.
(* to clarify, this is outside of actual model changes)
I suspect some of it is the fact context windows growing does harm performance, and early on you're more likely to be prodding at things in a way that has a smaller context window on average.
But I also think users just inherently are less reliable narrators than they think. They say they're trying the same tasks, but it may be the "same task" applied to a codebase with 1 month's more worth of development and complexity.
Or it's the "same task" but their less confident past self was "Clever Hans"-ing the model with some nuance that they've since discarded without realizing.
Or it's simple expectation creep and the tasks aren't similar at all from an LLM perspective due to limited generalization, but from a human perspective are. Switching languages might as well make it a new task as far LLM performance for example, but the human considers it the same task in a new language.
-
Whatever causes it, it's especially stressful because sometimes you do degrade the harness entirely accidentally but it's impossible to separate that signal from the noise from user accounts and an issue goes unfound way longer than it should.
Claude Code is somewhat fortunate that code has verifiable aspects though, so you don't need to 100% go on user account. My usecase relies much more on subjective preference, so dealing with this stuff becomes the 9th circle of hell.
There've been many times when a change to the LLM stack didn't make it to prod, I jumped the gun on announcing it, but users immediately flooded in with praise that the "missing" performance had returned.
Good-faith answer: I can't be certain. But I've been using CC since its release, and Cursor before that (and actually going all the way back to GPT3 to do codegen in the Playground). After getting used to the CC workflow, the way that I use it has been pretty consistent. To be specific, I use basically the same AGENTS.md with small modifications for each project, and I live almost exclusively in Plan mode and the best model (currently Opus 4.5).
My initial prompting is boilerplate at this point, and looks like this:
(Explain overall objective / problem without jumping to a solution)
(Provide all the detail / file references / past work I can think of)
(Ask it "what questions do you have for me before we build a plan?")
And then go back and forth until we have a plan.
Compared to my work with CC six months ago, it's just much more capable, able to solve more nuanced bugs, and less likely to generate spaghetti code.
That's why benchmarks are useful. We all suffer from the shortcomings of human perception.
Benchmarks shortcomings are no worse... they inevitably measure something that is only close to the thing you actually care about, not the thing you actually care about. It's entirely plausible that this decreased benchmark score is because Anthropic's initial prompting of the model was overtuned to the benchmark and as they're gaining more experience with real world use they are changing the prompt to do better at that and consequentially worse at the benchmark.
I wonder how best we can measure the usefulness of models going forward.
Thumbs up or down? (could be useful for trends) Usage growth from the same user over time? (as an approximation) Tone of user responses? (Don't do this... this is the wrong path... etc.)
Benchmarks measure what they measure. But your subjective experience also matters.
The easiest way would be to quantize the model, and serve different quants based on the current demand. Higher volumes == worse quant == more customers served per GPU
I was going to ask, are all other variables accounted for? Are we really comparing apples to apples here? Still worth doing obviously, as it serves a good e2e evaluations, just for curiosity's sake.
I upvoted, but
> Edit: Before you downvote, can you explain how the model could degrade WITHOUT changes to the prompts?
The article actually links to this fine postmortem by anthropic that demonstrates one way this is possible - software bugs affecting inference: https://www.anthropic.com/engineering/a-postmortem-of-three-...
Another way this is possible is the model reacting to "stimuli", e.g. the hypothesis at the end of 2023 that the (then current) ChatGPT was getting lazy because it was finding out the date was in december and it associated winter with shorter lazier responses.
A third way this is possible is the actual conspiracy version - Anthropic might make changes to make inference cheaper at the expense of the quality of the responses. E.g. quantizing weights further or certain changes to the sampling procedure.