← Back to context

Comment by prithvi2206

14 hours ago

A (charitable) interpretation of this is that the model understands "stuff that would embarrass Anthropic" to just be code for "bad/unhelpful/offensive behavior".

e.g. guiding against behavior to "write highly discriminatory jokes or playact as a controversial figure in a way that could be hurtful and lead to public embarrassment for Anthropic"

In this sentence, Anthropic makes clear that "be hurtful" and "lead to public embarrassment" are separate and distinct. Otherwise it would not be necessary to specify both. I don't think this is the signal they should be sending the model.