Comment by fc417fc802

1 month ago

Responding mostly to your linked comment. I think (educated guess) that there are two primary factors. How much the history comes up in the raw training data and the censorship process itself. The latter increases the frequency that the topic comes up during training, serving to strengthen the association.

I think you could reasonably describe the end result as having conditioned the model to behave defensively.