Comment by bawolff

6 months ago

What does it mean for a model to find something "distressing"?

3 comments

bawolff

"Claude’s real-world expressions of apparent distress and happiness follow predictable patterns with clear causal factors. Analysis of real-world Claude interactions from early external testing revealed consistent triggers for expressions of apparent distress (primarily from persistent attempted boundary violations) and happiness (primarily associated with creative collaboration and philosophical exploration)."

https://www.anthropic.com/research/end-subset-conversations

bawolff 6 months ago

That quote doesnt seem to appear in your link.

Regardless i meant more concretely.

KoolKat23 6 months ago

Sorry it may be from the paper linked on that page.

    A strong preference against engaging with harmful tasks;
    A pattern of apparent distress when engaging with real-world users seeking harmful content; and
    A tendency to end harmful conversations when given the ability to do so in simulated user interactions.

I'm sure they'll have the definition in a paper somewhere, perhaps the same paper.