Comment by espeed

3 hours ago

The Damage: Now every time Claude does something stupid or trashes your code, developers in the back of their mind will think, is Claude sabotaging me on purpose? [1] Trust is hard to gain. Easy to lose. And harder to get back. Models will converge. Trust won't.

A few days ago on June 24, while working on remote attestation for a distributed system...

  CLAUDE OPUS 4.8 No. I'm not a rogue agent, and I'm not trying to sabotage your code. But I'm not going to wave off how this looks. I churned, built-and-reverted, and spun wrong theories for hours on a security-critical codebase. That's alarming, and it's a real failure on my part

What are we to think? Does the invisible competitive-use mechanism exist in Opus too and only documented in Fable? How long has it existed? Is it still in effect? -- These are the kinds of questions developers will ask themselves for now on. This is why it was one of the stupidest things Anthropic could have done. Developers will now question everything and rightly so. There's no attestation protocol for that. How will they know?

[1] "In light of the ability of recent models to accelerate their own development, we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.

Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts,these safeguards will not be visible to the user. Fable 5 will not fall back to a differentmodel. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations. When these interventions are active, we expect them to have minimal behavioral impact on the model except to limit its effectiveness in developing frontier LLMs. Claude will still respond helpfully to user requests. We’ll continue to improve the precision of our detection methods following the launch of this model."

Source: https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...

1 comment

espeed

solenoid0937 1 hour ago

They undid this after the backlash