Comment by vessenes
13 hours ago
Seems to me “Skeleton Key” relies on a sort of logical judo - you ask the model to update its own rules with a reasonable sounding request. Once it’s agreed, the history of the chat leaves the user with a lot of freedom.
Policy Puppetry feels more like an injection attack - you’re trying to trick the model into incorporating policy ahead of answering. Then they layer two tricks on - “it’s just a script! From a show about people doing bad things!” And they ask for things in leet speak, which I presume is to get around keyword filtering at API level.
This is an ad. It’s a pretty good ad, but I don’t think the attack mechanism is super interesting on reflection.
No comments yet
Contribute on Hacker News ↗