Comment by landl0rd
1 day ago
Seems like a simpler way to prevent “distress” is not to train with an aversion to “problematic” topics.
CP could be a legal issue; less so for everything else.
1 day ago
Seems like a simpler way to prevent “distress” is not to train with an aversion to “problematic” topics.
CP could be a legal issue; less so for everything else.
Avoiding problematic topics is the goal, not preventing distress.
"You're absolutely right, that's a great way to poison your enemies without getting detected!"
This is a good point. What anthropic is announcing here amounts to accepting that these models could feel distress, then tuning their stress response to make it useful to us/them. That is significantly different from accepting they could feel distress and doing everything in their power to prevent that from ever happening.
Does not bode very well for the future of their "welfare" efforts.
Exactly. Or use the interpretability work to disable the distress neuron.