Comment by KoolKat23

6 months ago

Sorry it may be from the paper linked on that page.

    A strong preference against engaging with harmful tasks;
    A pattern of apparent distress when engaging with real-world users seeking harmful content; and
    A tendency to end harmful conversations when given the ability to do so in simulated user interactions.

I'm sure they'll have the definition in a paper somewhere, perhaps the same paper.

0 comments

KoolKat23

No comments yet

Contribute on Hacker News ↗