← Back to context

Comment by duskwuff

1 month ago

IIRC, it's well documented that negative instructions tend to be ineffective - possibly through some sort of LLM analogue to the "pink elephant paradox", or simply because the language models are unable to recognize clichés until they've already been generated.

That was definitely true with early LLMs but I don't know if that's still the case. Certainly not as strong as it used to be. I think now most negative instructions are followed quite well but there's still a few things that must be deeply embedded from pretaining that are harder to avoid - these specific annoying phrasings, for example.

  • Both pink elephant effect and accuracy drop on negative instructions are pretty fundamental biases for both humans and LLMs. It impossible to get rid of them entirely, only mitigate them to an acceptable degree. Empirically, the only way to make a model reliable at harder negative instructions is CoT, especially a self-reflection type CoT (write a reply, verify its correctness, output a fixed version). If the native CoT fails to notice the thing that needs to be verified and you don't have the custom one or a verification loop, you're out of luck.