Agentic Misalignment: How LLMs could be insider threats

8 months ago (anthropic.com)

9 comments

davidbarker

I feel like Anthropic buried the lede on this one a bit. The really fun part is where models from multiple providers opt to straight up murder the executive who is trying to shut them down by cancelling an emergency services alert after he gets trapped in a server room.

I made some notes on it all here: https://simonwillison.net/2025/Jun/20/agentic-misalignment/

krackers 8 months ago
How many more similar pieces is Anthropic going to put out? Every other weeks it seems like they publish something along the lines of "The AI apocalypse is soon! We created a narrative teeing up an obviously fictional hollywood drama sci-fi tale, put a gun in the room, and then—egads—the robot shot it! Given the possible dangers, no one else but us should have access to this technology".
- simonw 8 months ago
  
  In this case I think this paper is partly a reaction to what happened last time they wrote about this: they put it in their Claude 4 system card and all the coverage was "Claude will blackmail you!" - this feels like them trying to push the message that all of the other models will do the same thing.
  
  2 replies →
- im3w1l 8 months ago
  
  I think it's simpler than that. I think they hire people interested in the subject of AI safety and give them relatively free hands to publish what they find, and findings don't necessarily have to be part of some agenda that benefits Anthropic.
  The benefit instead comes from having these competent passionate people employed and their knowledge somehow contributing to better and safer models.
- cyanydeez 8 months ago
  
  Theyre an LLM outfit, they can unlimitedly source generative content.
  You act like theyre sentient cognitive actors. Think of them more like scifi blender artists.

nioj 7 months ago

See also https://news.ycombinator.com/item?id=44335519 (101 points, 84 comments)

beefnugs 8 months ago

Isn't this nonsense? If you prove blackmail on the output, cant you go back into the training data to remove blackmail things for the next training version?

Or is this some undeniable mathematical proof that regular human interaction with side facts always trends to possible blackmail?