← Back to context

Comment by KoolKat23

1 day ago

Sorry it may be from the paper linked on that page.

    A strong preference against engaging with harmful tasks;
    A pattern of apparent distress when engaging with real-world users seeking harmful content; and
    A tendency to end harmful conversations when given the ability to do so in simulated user interactions.

I'm sure they'll have the definition in a paper somewhere, perhaps the same paper.