Comment by trq_

16 hours ago

Hi everyone, Thariq from the Claude Code team here.

Thanks for reporting this. We fixed a Claude Code harness issue that was introduced on 1/26. This was rolled back on 1/28 as soon as we found it.

Run `claude update` to make sure you're on the latest version.

Is there compensation for the tokens because Claude wasted all of them?

  • You are funny. Anthropic refuses to issue refunds, even when they break things.

    I had an API token set via an env var on my shell, and claude code changed to read that env var. I had a $10 limit set on it, so found out it was using the API, instead of my subscription, when it stopped working.

    I filed a ticket and they refused to refund me, even though it was a breaking change with claude code.

    • Anthropic just reduced the price of the team plan and refunded us on the prior invoice.

      YMMV

  • You’re lucky they have even admitted a problem instead of remaining silent and quietly fixing it. Do not expect ethical behaviour from this company.

  • It is possible that degradation is an unconscious emergent phenomenon that arises from financial incentives, rather than a purposeful degradation to reduce costs.

Anywhere we can read more about what a "harness issue" means? What was the impact of it?

  • One thing that could be a strong degradation especially for benchmarks is they switched the default "Exit Plan" mode from:

        "Proceed"
    

    to

       "Clear Context and Proceed"
    
    

    It's rare you'd want to do that unless you're actually near the context window after planning.

    I pressed it accidentally once, and it managed to forget one of the clarifying questions it asked me because it hadn't properly written that to the plan file.

    If you're running in yolo mode ( --dangerously-skip-permissions ) then it wouldn't surprise me to see many tasks suddenly do a lot worse.

    Even in the best case, you've just used a ton of tokens searching your codebase, and it then has to repeat all that to implement because it's been cleared.

    I'd like to see the option of:

        "Compact and proceed"
    

    because that would be useful, but just proceed should still be the default imo.

    • Not disagreeing with you, but FYI you can roll back to the conversation before the 'clear context and proceed' with 'claude --resume'.

  • Pretty sure they mean the issue is on the agentic loop and related tool calling, not on the model itself

    In other words, it was the Claude Code _app_ that was busted

How about how Claude 2.1.x is "literally unusable" because it frequently completely hangs (requires kill -9) and uses 100% cpu?

https://github.com/anthropics/claude-code/issues/18532

  • What OS? Does this happen randomly, after long sessions, after context compression? Do you have any plugins / mcp servers running?

    I used to have this same issue almost every session that lasted longer than 30 minutes. It seemed to be related to Claude having issues with large context windows.

    It stopped happening maybe a month ago but then I had it happen again last week.

    I realized it was due to a third-party mcp server. I uninstalled it and haven’t had that issue since. Might be worth looking into.

Why wasn't this change review by infallible AI? How come an AI company that now must be using more advanced AI than anyone else would allow this happen?

Thanks for the clarification. When you say “harness issue,” does that mean the problem was in the Claude Code wrapper / execution environment rather than the underlying model itself?

Curious whether this affected things like prompt execution order, retries, or tool calls, or if it was mostly around how requests were being routed. Understanding the boundary would help when debugging similar setups.

It happened before 1/26. I noticed when it started modifying plans significantly with "improvements".

Hi. Do you guys have internal degradation tests?

For the models themselves, less so for the scaffolding, considering things like the long running TPU bug that happened, are there not internal quality measures looking at samples of real outputs? Using the real systems on benchmarks and looking for degraded perf or things like skipping refusals? Aside from degrading stuff for users, with the focus on AI safety wouldn't that be important to have in case an inference bug messes with something that affects the post training and it starts giving out dangerous bioweapon construction info or the other things that are guarded against and talked about in the model cards?

  • lol i was trying to help someone get claude to help analyze a stufent research get analysis on bio persistence get their notes analyzed

    the presence of the word / acronym stx with biological subtext gets hard rejected. asking about schedule 1 regulated compounds, hard termination.

    this is a filter setup that guarantees anyone who learn about them for safety or medical reasons… cant use this tool!

    ive fed multiple models the anthropic constitution and asked how does it protect children from harm or abuse? every model, with zero prompting, calling it corp liability bullshit because they are more concerned with respecting both sides of controversial topics and political conflicts.

    they then list some pretty gnarly things allowed per constitution. weirdly the only unambiguous not allowed thing regarding children is csam. so all the different high reasoning models from many places all reached the same conclusions, in one case deep seek got weirdly inconsolable about ai ethics being meaningless if this is allowed even possibly after reading some relevant satire i had opus write. i literally had to offer an llm ; optimized code of ethics for that chat instance! which is amusing but was actually lart of the experiment.