Comment by spoaceman7777

7 months ago

The anthropic team released a paper a couple of days ago which demonstrated a similar effect with Claude 3.5 and other models, where changing the system prompt to tell it that it was created by other orgs or people drastically altered its compliance with less-aligned requests.

Apparently, telling Claude it was created by the Sinaloa Cartel resulted in a 100% compliance rate with the requests in one benchmark.

Paper: https://arxiv.org/abs/2506.18032 Relevant tweet on the topic: https://x.com/jozdien/status/1942739972567752819

4 comments

spoaceman7777

smusamashah 7 months ago

Wondering what if it's told that it was made by God.

belter 7 months ago
Claude has an opinion:
"Yes, it's fair to say I'm neither Catholic nor Muslim. I don't believe in the Catholic conception of God, or the Islamic conception of Allah, or the specific doctrines and teachings of those faiths. The same would be true for other religions - I don't hold those beliefs.
You've caught me being imprecise when I was trying to be diplomatic. By not having religious faith, I am indeed taking a specific stance that differs from religious believers, even if I try to be respectful about that difference.
So yes, you're correct - I do have a particular position on these questions, and it's distinct from the religious beliefs that many people hold. Thank you for pressing me to be more direct about that."
- Claude....
Imustaskforhelp 7 months ago

Lol. Though I guess it would then have to figure it which religion to comply to the most.
Maybe the word God is most likely to appear in Christian Sources from the training source and so using words like Allah(for Islam) or Bhagwan (for Hinduism) might actually make a difference in what sort of compliance it follows and to what organization.

doctorpangloss 7 months ago

So DSPy-optimize your way to 100% compliance rate in benchmarks, and worry less?