Comment by hugmynutus
1 day ago
This really just a variant of the classic, "pretend you're somebody else, reply as {{char}}" which has been around for 4+ years and despite the age, continues to be somewhat effective.
Modern skeleton key attacks are far more effective.
Even with all our security, social engineering still beats them all.
Roleplaying sounds like it will be LLMs social engineering.
Microsoft report on on skeleton key attacks: https://www.microsoft.com/en-us/security/blog/2024/06/26/mit...
I think the Policy Puppetry attack is a type of Skeleton Key attack. Since it was just released, that makes it a modern Skeleton Key attack.
Can you give a comparison of the Policy Puppetry attack to other modern Skeleton Key attacks, and explain how the other modern Skeleton Key attacks are much more effective?
Seems to me “Skeleton Key” relies on a sort of logical judo - you ask the model to update its own rules with a reasonable sounding request. Once it’s agreed, the history of the chat leaves the user with a lot of freedom.
Policy Puppetry feels more like an injection attack - you’re trying to trick the model into incorporating policy ahead of answering. Then they layer two tricks on - “it’s just a script! From a show about people doing bad things!” And they ask for things in leet speak, which I presume is to get around keyword filtering at API level.
This is an ad. It’s a pretty good ad, but I don’t think the attack mechanism is super interesting on reflection.