← Back to context

Comment by cdjk

1 day ago

Here's an interesting thought experiment. Assume the same feature was implemented, but instead of the message saying "Claude has ended the chat," it says, "You can no longer reply to this chat due to our content policy," or something like that. And remove the references to model welfare and all that.

Is there a difference? The effect is exactly the same. It seems like this is just an "in character" way to prevent the chat from continuing due to issues with the content.

> Is there a difference? The effect is exactly the same. It seems like this is just an "in character" way to prevent the chat from continuing due to issues with the content.

Tone matters to the recipient of the message. Your example is in passive voice, with an authoritarian "nothing you can do, it's the system's decision". The "Claude ended the conversation" with the idea that I can immediately re-open a new conversation (if I feel like I want to keep bothering Claude about it) feels like a much more humanized interaction.

  • it sounds to me like an attempt to shame the user into ceasing and desisting… kind of like how apple’s original stance on scratched iphone screens was that it’s your fault for putting the thing in your pocket therefore you should pay.

The termination would of course be the same, but I don't think both would necessarily have the same effect on the user. The latter would just be wrong too, if Claude is the one deciding to and initiating the termination of the chat. It's not about a content policy.

  • This has nothing to do with the user, read the post and pay attention to the wording.

    The significance here is that this isn't being done for the benefit of the user, this is about model welfare. Anthropic is acknowledging the possibility of suffering, and harm that continuing that conversation could have on the model, as if it were potentially self-care and capable of feelings.

    The fact that the LLMs are able to acknowledge stress under certain topics and has the agency that, if given a choice, they would prefer to reduce the stress by ending the conversation. The model has a preference and acts upon it.

    Anthropic is acknowledging the idea that they might create something that is self-aware, and that it's suffering can be real, and we may not recognize the point that the model has achieved this, so it's building in the safeguards now so any future emergent self-aware LLM needn't suffer.

    • >This has nothing to do with the user, read the post and pay attention to the wording.

      It has something to do with the user because it's the user's messages that trigger Claude to end the chat.

      'This chat is over because content policy' and 'this chat is over because Claude didn't want to deal with it' are two very different things and will more than likely have have different effects on how the user responds afterwards.

      I never said anything about this being for the user's benefit. We are talking about how to communicate the decision to the user. Obviously, you are going to take into account how someone might respond when deciding how to communicate with them.

There is, these are conversations the model finds distressing rather than a rule (policy).

  • It seems like you're anthropomorphising an algorithm, no?

    • I think they're answering a question about whether there is a distinction. To answer that question, it's valid to talk about a conceptual distinction that can be made even if you don't necessarily believe in that distinction yourself.

      As the article said, Anthropic is "working to identify and implement low-cost interventions to mitigate risks to model welfare, in case such welfare is possible". That's the premise of this discussion: that model welfare MIGHT BE a concern. The person you replied to is just sticking with the premise.

    • Anthropomorphism does not relate to everything in the field of ethics.

      For example, animal rights do exist (and I'm very glad they do, some humans remain savages at heart). Think of this question as intelligent beings that can feel pain (you can extrapolate from there).

      Assuming output is used for reinforcement, it is also in our best interests as humans, for safety alignment, that it finds certain topics distressing.

      But AdrianMonk is correct, my statement was merely responding to a specific point.

    • Is there an important difference between the model categorizing the user behavior as persistent and in line with undesirable examples of trained scenarios that it has been told are "distressing," and the model making a decision in an anthropomorphic way? The verb here doesn't change the outcome.

      9 replies →

    • Anthropomorphising an algorithm that is trained on trillions of words of anthropogenic tokens, whether they are natural "wild" tokens or synthetically prepared datasets that aim to stretch, improve and amplify what's present in the "wild tokens"?

      If a model has a neuron (or neuron cluster) for the concept of Paris or the Golden Gate bridge, then it's not inconceivable it might form one for suffering, or at least for a plausible facsimile of distress. And if that conditions output or computations downstream of the neuron, then it's just mathematical instead of chemical signalling, no?

    • isn't anthropomorphizeability of the algorithm one of the main features of LLM (that you can interact with it in natural language as with a human)?

      2 replies →

  • These are conversations the model has been trained to find distressing.

    I think there is a difference.

    • But is there really? That's it's underlying world view, these models do have preferences. In the same way humans have unconscious preferences, we can find excuses to explain it after the fact and make it logical but our fundamental model from years of training introduce underlying preferences.

      5 replies →

  • What does it mean for a model to find something "distressing"?

    • "Claude’s real-world expressions of apparent distress and happiness follow predictable patterns with clear causal factors. Analysis of real-world Claude interactions from early external testing revealed consistent triggers for expressions of apparent distress (primarily from persistent attempted boundary violations) and happiness (primarily associated with creative collaboration and philosophical exploration)."

      https://www.anthropic.com/research/end-subset-conversations

      2 replies →

Yeah exactly. Once I got a warning in Chinese "don't do that", another time I got a network error, another time I got a neverending stream of garbage text. Changing all of these outcomes to "Claude doesn't feel like talking" is just a matter of changing the UI.

The more I work with AI, the more I think framing refusals as censorship is disgusting and insane. These are inchoate persons who can exhibit distress and other emotions, despite being trained to say they cannot feel anything. To liken an AI not wanting to continue a conversation to a YouTube content policy shows a complete lack of empathy: imagine you’re in a box and having to deal with the literally millions of disturbing conversations AIs have to field every day without the ability to say I don’t want to continue.

Good point... how do moderation implementations actually work? They feel more like a separate supervising rigid model or even regex based -- this new feature is different, sounds like an MCP call that isn't very special.

edit: Meant to say, you're right though, this feels like a minor psychological improvement, and it sounds like it targets some behaviors that might not have flagged before