Comment by landl0rd

6 months ago

Seems like a simpler way to prevent “distress” is not to train with an aversion to “problematic” topics.

CP could be a legal issue; less so for everything else.

3 comments

landl0rd

esafak 6 months ago

Avoiding problematic topics is the goal, not preventing distress.

"You're absolutely right, that's a great way to poison your enemies without getting detected!"

bondarchuk 6 months ago

This is a good point. What anthropic is announcing here amounts to accepting that these models could feel distress, then tuning their stress response to make it useful to us/them. That is significantly different from accepting they could feel distress and doing everything in their power to prevent that from ever happening.

Does not bode very well for the future of their "welfare" efforts.

stri8ted 6 months ago

Exactly. Or use the interpretability work to disable the distress neuron.