Sorry it may be from the paper linked on that page.
A strong preference against engaging with harmful tasks;
A pattern of apparent distress when engaging with real-world users seeking harmful content; and
A tendency to end harmful conversations when given the ability to do so in simulated user interactions.
I'm sure they'll have the definition in a paper somewhere, perhaps the same paper.
Sorry it may be from the paper linked on that page.
I'm sure they'll have the definition in a paper somewhere, perhaps the same paper.