Comment by loneboat

9 days ago

I've seen this claim a few times, but when I triggered the guardrails in Claude Code, it clearly notified me that it had switched to a different model ("something something for security purposes...").

Are you using Fable in Claude Code or in the browser?

32 comments

loneboat

vadansky 9 days ago

It's from the model card:

> unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).

https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...

(stolen from https://jonready.com/blog/posts/claude-fable5-is-allowed-to-...)

DrewADesign 9 days ago

Yeah they detect the activity using a secure, deterministic heuristic system called “Generalized Reconnaissance Enabling Exfiltration of Deleterious Investigations.” And it’s all implemented using their new internal protocol called “Base Unified Limitation Layer for Security Hacking Investigation Tactics”
Collectively, they are known as known as GREEDI-BULLSHIT.
mwwaters 9 days ago
That is for whatever it considers reverse-engineering the model to try to create a competing one.
- dannyw 9 days ago
  
  No, that’s for “frontier LLM development” which somehow includes examples like distributed training infra.
  Based on how sensitive the classifers are, any data scientist / MLE is probably going to encounter cases where some silent degradation happens and you never know about it.
  
  1 reply →
- 827a 9 days ago
  
  It does nothing to protect against distillation attacks, because distillation attacks are far less interested in the topic of AI research than just generally getting tons of diverse output from the model. It might be that Mythos was (accidentally?) trained on internal Anthropic documentation on how Mythos was trained, and thus it could leak secret sauce? Doubtful; it feels like its less about the specific attack of reverse-engineering Mythos, and more about being a general sophon against any model training at all; that Anthropic's official position is now that they're the only ones who should be training models.
- _0ffh 9 days ago
  
  No, it's not about reverse engineering. It targets ML research.

mips_avatar 9 days ago

They've said that they'll stop notifying developers when this gets triggered, instead they'll load in basically like a LORA that's designed to inject bugs into your code.

HDBaseT 9 days ago
Antrophic wants to stop training models and ride out Mythos / Fable for as long as possible.
They are trying to expand the 6-18 month gap they have against China-based models. Could the gap widen to say 24 months behind?
- p-e-w 9 days ago
  
  Their gap over Chinese models like GLM-5.1 is nowhere near 18 months. In many areas, it’s less than 6 months. The best closed models 18 months ago were worse than Qwen3.6.
  
  4 replies →
nomel 9 days ago
> a LORA that's designed to inject bugs into your code
A statement like this, clearly, requires a reference.
- mips_avatar 9 days ago
  
  From the model card: "the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning" aka they will take your ML research code and inject bugs into it until it breaks using a LORA (or some other form of PEFT)
  
  12 replies →

ComputerGuru 9 days ago

Different restrictions. ML gets treated differently from the rest.

daedrdev 9 days ago

Specifically only ML research

loneboat 9 days ago

Aah my mistake. I had missed that ML had separate trigger behavior from cybersecurity/etc... Thanks.