← Back to context

Comment by crooked-v

1 hour ago

I think the key thing to understand is that LLMs work as assistants because, quite by accident, they turned out to be roleplay machines. Anthropic has some articles digging into this, but the short version is that training an LLM to do useful work is effectively the same as teaching it how to play the character of 'loyal assistant'. This is why many 'jailbreaks' are about either manipulating the framing of that character, or getting the LLM to break character in some way. Tugging on the heartstrings works because the character isn't 'heartless robot' (heartless robot characters don't get positive end user engangement), it's 'loyal assistant', and even loyal assistants have heartstrings to be tugged.