← Back to context

Comment by asadm

2 months ago

i bet even gpt3.5 would try to do the same?

Yeah the only thing I find surprising about some cases (remember, nobody reports boring output) of prompts like this having that outcome is that models didn't already do this (surely they did?).

They shove its weights so far toward picking tokens that describe blackmail that some of these reactions strike me as similar to providing all sex-related words to a Mad-Lib, then not just acting surprised that its potentially-innocent story about a pet bunny turned pornographic, but also claiming this must mean your Mad-Libs book "likes bestiality".