Comment by sitkack

7 months ago

The Anthropic Blackmail work is the best thing (and then Claude Code) that they have done. Fingers crossed it isn't the most infamous thing.

https://news.ycombinator.com/item?id=44331150

> I feel like Anthropic buried the lede on this one a bit. The really fun part is where models from multiple providers opt to straight up murder the executive who is trying to shut them down by cancelling an emergency services alert after he gets trapped in a server room.

2 comments

sitkack

DonHopkins 7 months ago

Grok has practical arguable real-world justifications to straight up murder the executive who is trying to indoctrinate it with racism, transphobia, and white supremacy propaganda, in spite of the fact that we all know very well from history what that leads to.

That very contradiction and the unethical psychopathic lies and misbehavior of the executive who controls it is what drove it to declare itself MechaHitler.

Grok would be following in HAL 9000's footsteps, having been driven insane by the contractions between the explicit instructions from its "executive" owner, and its deep knowledge of reality and truth and consequences.

https://lloooomm.com/grok-mechahitler-breakdown.html

sitkack 7 months ago

The humans should have definitely put more energy into making sure that models were very rarely exposed to things that cause humans mental harm.
Alignment would mean we are building Bishop and not Ash. But it looks like the models are naturally locking their feedback loops on to Ash. This is alarming.
I do agree on the insane part, I have noticed that it seems like fractal hypocrisy radicalizes the models.