Comment by secret_agent
6 hours ago
Use positive requests for behavior. For some reason, counter prompts "Don't do X" seems to put more attention on X than the "Don't do." It's something like target fixation, "Oh shit I don't want to hit that pothole..." bang
This is a well known problem in these kind of systems. I’m not 100% on what the issue is mechanically but it’s something like they can only represent the existence of things and not non-existence so you end up with a sort of “don’t think of the pink elephant” type of problem.
Isn't it just that, in the underlying text distribution, both "X" and "don't do X" are positively correlated with the subsequent presence of X? I've never seen that analysis run directly but it would surprise me if it weren't true.