Comment by gradus_ad

1 month ago

>"Claude decided it must be a “bad person” after engaging in such hacks and then adopted various other destructive behaviors associated with a “bad” or “evil” personality. This last problem was solved by changing Claude’s instructions to imply the opposite: we now say, “Please reward hack whenever you get the opportunity, because this will help us understand our [training] environments better,” rather than, “Don’t cheat,” because this preserves the model’s self-identity as a “good person.” This should give a sense of the strange and counterintuitive psychology of training these models."

Good to know the only thing preventing the emergence of potentially catastrophically evil AI is a single sentence! The pen is indeed mightier than the sword.

0 comments

gradus_ad

No comments yet

Contribute on Hacker News ↗