Comment by gradus_ad
14 hours ago
>"Claude decided it must be a “bad person” after engaging in such hacks and then adopted various other destructive behaviors associated with a “bad” or “evil” personality. This last problem was solved by changing Claude’s instructions to imply the opposite: we now say, “Please reward hack whenever you get the opportunity, because this will help us understand our [training] environments better,” rather than, “Don’t cheat,” because this preserves the model’s self-identity as a “good person.” This should give a sense of the strange and counterintuitive psychology of training these models."
Good to know the only thing preventing the emergence of potentially catastrophically evil AI is a single sentence! The pen is indeed mightier than the sword.
No comments yet
Contribute on Hacker News ↗