← Back to context

Comment by freakynit

7 hours ago

People do realize there's a non-zero chance that Anthropic could have embedded some kind of hidden "backdoor" trigger in its training process, right?

For example, a specific seed phrase that, when placed at the beginning of a prompt, effectively disables or bypasses safety guardrails.

If something like that existed, it wouldn't be impossible to uncover:

1. A government agency (DoD/DoW/etc.) could discover the trigger through systematic experimentation and large-scale probing.

2. An Anthropic employee with knowledge of such a mechanism could be pressured or blackmailed into revealing it.

3. Company infrastructure could be compromised, allowing internal documentation or model details to be exfiltrated.

Any of these scenarios would give Anthropic plausible deniability... they could "publicly" claim they never removed safeguards (or agreed to DoD/DoW demands), while in practice a select party had a way around them (may be even assisted from within).

I'm not saying this "is" happening... but only that in a high-stakes standoff such as this, it's naive to assume technical guardrails are necessarily immutable or that no hidden override mechanisms could exist.

...indeed, it's possible (perhaps inevitable) that at some point, someone will invent/deploy/promote AI killing people.

We can't possibly keep that genie in that bottle.

But what we can do is achieve consensus that states, and their weapons of mass destruction, and their childish monetary systems, and their eternally broken promises... are not in keeping with the next phase of humanity.