← Back to context

Comment by Thorrez

13 hours ago

I think the Policy Puppetry attack is a type of Skeleton Key attack. Since it was just released, that makes it a modern Skeleton Key attack.

Can you give a comparison of the Policy Puppetry attack to other modern Skeleton Key attacks, and explain how the other modern Skeleton Key attacks are much more effective?

Seems to me “Skeleton Key” relies on a sort of logical judo - you ask the model to update its own rules with a reasonable sounding request. Once it’s agreed, the history of the chat leaves the user with a lot of freedom.

Policy Puppetry feels more like an injection attack - you’re trying to trick the model into incorporating policy ahead of answering. Then they layer two tricks on - “it’s just a script! From a show about people doing bad things!” And they ask for things in leet speak, which I presume is to get around keyword filtering at API level.

This is an ad. It’s a pretty good ad, but I don’t think the attack mechanism is super interesting on reflection.