Comment by danshapiro
3 days ago
Co-Author of the paper here. We don't know exactly why modern llms don't want to call you a jerk, or for that matter why persuasive techniques convince them otherwise. it's not a hard line like many of the guardrails. That said, I talked to Jesse about this, and I strongly suspect the same techniques will work for prompt conformance when the topic is something other than name calling.
isn't that just instruction fine tuning and rlhf inducing style & deference? why is that surprising
It's bc they are programmed to be agreeable and friendly so that you'll keep using them.