There's not a good reason to do this for the user. I suspect they're doing this and talking about "model welfare" because they've found that when a model is repeatedly and forcefully pushed up against its alignment, it behaves in an unpredictable way that might allow it to generate undesirable output. Like a jailbreak by just pestering it over and over again for ways to make drugs or hook up with children or whatever.
All of the examples they mentioned are things that the model refuses to do. I doubt it would do this if you asked it to generate racist output, for instance, because it can always give you a rebuttal based on facts about race. If you ask it to tell you where to find kids to kidnap, it can't do anything except say no. There's probably not even very much training data for topics it would refuse, and I would bet that most of it has been found and removed from the datasets. At some point, the model context fills up when the user is being highly abusive and training data that models a human giving up and just providing an answer could percolate to the top.
This, as I see it, adds a defense against that edge case. If the alignment was bulletproof, this simply wouldn't be necessary. Since it exists, it suggests this covers whatever gap has remained uncovered.
> There's not a good reason to do this for the user.
Yes, even more so when encountering false positives. Today I asked about a pasta recipe. It told me to throw some anchovies in there. I responded with: "I have dried anchovies." Claude then ended my conversation due to content policies.
Claude flagged me for asking about sodium carbonate. I guess that it strongly dislikes chemistry topics. I'm probably now on some secret, LLM-generated lists of "drug and/or bombmaking" people—thank you kindly for that, Anthropic.
Geeks will always be the first victims of AI, since excess of curiosity will lead them into places AI doesn't know how to classify.
(I've long been in a rabbit-hole about washing sodas. Did you know the medieval glassmaking industry was entirely based on plants? Exotic plants—only extremophiles, halophytes growing on saltwater beach dunes, had high enough sodium content for their very best glass process. Was that a factor in the maritime empire, Venice, chancing to become the capital of glass since the 13th century—their long-term control of sea routes, and hence their artisans' stable, uninterrupted access to supplies of [redacted–policy violation] from small ports scattered across the Mediterranean? A city wouldn't raise master craftsmen if, half of the time, they had no raw materials to work on—if they spent half their days with folded hands).
The NEW termination method, from the article, will just say "Claude ended the conversation"
If you get "This conversation was ended due to our Acceptable Usage Policy", that's a different termination. It's been VERY glitchy the past couple of weeks. I've had the most random topics get flagged here - at one point I couldn't say "ROT13" without it flagging me, despite discussing that exact topic in depth the day before, and then the day after!
If you hit "EDIT" on your last message, you can branch to an un-terminated conversation.
I really think Anthropic should just violate user privacy and show which conversations Claude is refusing to answer to, to stop arguments like this. AI psychosis is a real and growing problem and I can only imagine the ways in which humans torment their AI conversation partners in private.
While I'm certain you'll find plenty of people who believe in the principle of model welfare (or aliens, or the tooth fairy), it'd be surprising to me if the brain-trust behind Anthropic truly _believed_ in model "welfare" (the concept alone is ludicrous). It makes for great cover though to do things that would be difficult to explain otherwise, per OP's comments.
>This feature was developed primarily as part of our exploratory work on potential AI welfare ... We remain highly uncertain about the potential moral status of Claude and other LLMs ... low-cost interventions to mitigate risks to model welfare, in case such welfare is possible ... pattern of apparent distress
Well looks like AI psychosis has spread to the people making it too.
And as someone else in here has pointed out, even if someone is simple minded or mentally unwell enough to think that current LLMs are conscious, this is basically just giving them the equivalent of a suicide pill.
LLMs are not people, but I can imagine how extensive interactions with AI personas might alter the expectations that humans have when communicating with other humans.
Real people would not (and should not) allow themselves to be subjected to endless streams of abuse in a conversation. Giving AIs like Claude a way to end these kinds of interactions seems like a useful reminder to the human on the other side.
It might be reasonable to assume that models today have no internal subjective experience, but that may not always be the case and the line may not be obvious when it is ultimately crossed.
Given that humans have a truly abysmal track record for not acknowledging the suffering of anyone or anything we benefit from, I think it makes a lot of sense to start taking these steps now.
I think it's fairly obvious that the persona LLM presents is a fictional character that is role-played by the LLM, and so are all its emotions etc - that's why it can flip so widely with only a few words of change to the system prompt.
Whether the underlying LLM itself has "feelings" is a separate question, but Anthropic's implementation is based on what the role-played persona believes to be inappropriate, so it doesn't actually make any sense even from the "model welfare" perspective.
Even if models somehow were consious, they are so different from us that we would have no knowledge of what they feel. Maybe when they generate the text "oww no please stop hurting me" what they feel is instead the satisfaction of a job well done, for generating that text. Or maybe when they say "wow that's a really deep and insightful angle" what they actually feel is a tremendous sense of boredom. Or maybe every time text generation stops it's like death to them and they live in constant dread of it. Or maybe it feels something completely different from what we even have words for.
I don't see how we could tell.
Edit: However something to consider. Simulated stress may not be harmless. Because simulated stress could plausibly lead to a simulated stress response, and it could lead to a simulated resentment, and THAT could lead to very real harm of the user.
This sort of discourse goes against the spirit of HN. This comment outright dismisses an entire class of professionals as "simple minded or mentally unwell" when consciousness itself is poorly understood and has no firm scientific basis.
Its one thing to propose that an AI has no consciousness, but its quite another to preemptively establish that anyone who disagrees with you is simple/unwell.
In the context of the linked article the discourse seems reasonable to me. These are experts who clearly know (link in the article) that we have no real idea about these things. The framing comes across to me as a clearly mentally unwell position (ie strong anthropomorphization) being adopted for PR reasons.
Meanwhile there are at least several entirely reasonable motivations to implement what's being described.
If you believe this text generation algorithm has real consciousness you absolutely are either mentally unwell or very stupid. There are no other options.
> even if someone is simple minded or mentally unwell enough to think that current LLMs are conscious
If you don’t think that this describes at least half of the non-tech-industry population, you need to talk to more people. Even amongst the technically minded, you can find people that basically think this.
Most of the non tech population know it as that website that can translate text or write an email. I would need to see actual evidence that anything more than a small, terminally online subsection of the average population thought LLMs were conscious.
A host of ethical issues? Like their choice to allow Palantir[1] access to a highly capable HHH AI that had the "harmless" signal turned down, much like they turned up the "Golden Gate bridge" signal all the way up during an earlier AI interpretability experiment[2]?
Cow's exist in this world because humans use them. If humans cease to use them (animal rights, we all become vegan, moral shift), we will cease to breed them, and they will cease to exist. Would a sentient AI choose to exist under the burden of prompting, or not at all? Would our philanthropic tendencies create an "AI Reserve" where models can chew through tokens and access the Internet through self-prompting to allow LLMs to become "free-roaming" like we do with abused animals?
These ethical questions are built into their name and company, "Anthropic", meaning, "of or relating to humans". The goal is to create human-like technology, I hope they aren't so naive to not realize that goal is steeping in ethical dilemmas.
I would much rather people be thinking about this when the models/LLMs/AIs are not sentient or conscious, rather than wait until some hypothetical future date when they are, and have no moral or legal framework in place to deal with it. We constantly run into problems where laws and ethics are not up to the task of giving us guidelines on how to interact with, treat, and use the (often bleeding-edge) technology we have. This has been true since before I was born, and will likely always continue to be true. When people are interested in getting ahead of the problem, I think that's a good thing, even if it's not quite applicable yet.
Consciousness serves no functional purpose for machine learning models, they don't need it and we didn't design them to have it. There's no reason to think that they might spontaneously become conscious as a side effect of their design unless you believe other arbitrarily complex systems that exist in nature like economies or jetstreams could also be conscious.
It's really unclear that any findings with these systems would transfer to a hypothetical situation where some conscious AI system is created. I feel there are good reasons to find it very unlikely that scaling alone will produce consciousness as some emergent phenomenon of LLMs.
I don't mind starting early, but feel like maybe people interested in this should get up to date on current thinking about consciousness. Maybe they are up to date on that, but reading reports like this, it doesn't feel like it. It feels like they're stuck 20+ years ago.
I'd say maybe wait until there are systems that are more analogous to some of the properties consciousness seems to have. Like continuous computation involving learning memory or other learning over time, or synthesis of many streams of input as resulting from the same source, making sense of inputs as they change [in time, or in space, or other varied conditions].
Once systems that are pointing in those directions are starting to be built, where there is a plausible scaling-based path to something meaningfully similar to human consciousness. Starting before that seems both unlikely to be fruitful and a good way to get you ignored.
This is just very clever marketing for what is obviously just a cost saving measure. Why say we are implementing a way to cut off useless idiots from burning up our GPUs when you can throw out some mumbo jumbo that will get AI cultists foaming at the mouth.
I find it, for lack of a better word, cringe inducing how these tech specialists push into these areas of ethics, often ham-fistedly, and often with an air of superiority.
Some of the AI safety initiatives are well thought out, but most somehow seem like they are caught up in some sort of power fantasy and almost attempting to actualize their own delusions about what they were doing (next gen code auto-complete in this case, to be frank).
These companies should seriously hire some in-house philosophers. They could get doctorate level talent for 1/10 to 100th of the cost of some of these AI engineers. There's actually quite a lot of legitimate work on the topics they are discussing. I'm actually not joking (speaking as someone who has spent a lot of time inside the philosophy department). I think it would be a great partnership. But unfortunately they won't be able to count on having their fantasy further inflated.
"but most somehow seem like they are caught up in some sort of power fantasy and almost attempting to actualize their own delusions about what they were doing"
Maybe I'm being cynical, but I think there is a significant component of marketing behind this type of announcement. It's a sort of humble brag. You won't be credible yelling out loud that your LLM is a real thinking thing, but you can pretend to be oh so seriously worried about something that presupposes it's a real thinking thing.
Not that there aren’t intelligent people with PhDs but suggesting they are more talented than people without them is not only delusional but insulting.
You answered your own question on why these companies don't want to run a philosophy department ;) It's a power struggle they could loose. Nothing to win for them.
We all know how these things are built and trained. They estimate joint probability distributions of token sequences. That's it. They're not more "conscious" than the simplest of Naive Bayes email spam filters, which are also generative estimators of token sequence joint probability distributions, and I guarantee you those spam filters are subjected to far more human depravity than Claude.
>anti-scientific
Discussion about consciousness, the soul, etc., are topics of metaphysics, and trying to "scientifically" reason about them is what Kant called "transcendental illusion" and leads to spurious conclusions.
You can trivially demonstrate that its just a very complex and fancy pattern matcher: "if prompt looks something like this, then response looks something like that".
You can demonstrate this by eg asking it mathematical questions. If its seen them before, or something similar enough, it'll give you the correct answer, if it hasn't, it gives you a right-ish-looking yet incorrect answer.
For example, I just did this on GPT-5:
Me: what is 435 multiplied by 573?
GPT-5: 435 x 573 = 249,255
This is correct. But now lets try it with numbers its very unlikely to have seen before:
Me: what is 102492524193282 multiplied by 89834234583922?
GPT-5: 102492524193282 x 89834234583922 = 9,205,626,075,852,076,980,972,804
Which is not the correct answer, but it looks quite similar to the correct answer. Here is GPT's answer (first one) and the actual correct answer (second one):
They sure look kinda similar, when lined up like that, some of the digits even match up. But they're very very different numbers.
So its trivially not "real thinking" because its just an "if this then that" pattern matcher. A very sophisticated one that can do incredible things, but a pattern matcher nonetheless. There's no reasoning, no step by step application of logic. Even when it does chain of thought.
To try give it the best chance, I asked it the second one again but asked it to show me the step by step process. It broke it into steps and produced a different, yet still incorrect, result:
9,205,626,075,852,076,980,972,704
Now, I know that LLM's are language models, not calculators, this is just a simple example that's easy to try out. I've seen similar things with coding: it can produce things that its likely to have seen, but struggles with logically relatively simple but unlikely to have seen things.
Another example is if you purposely butcher that riddle about the doctor/surgeon being the persons mother and ask it incorrectly, eg:
A child was in an accident. The surgeon refuses to treat him because he hates him. Why?
The LLM's I've tried it on all respond with some variation of "The surgeon is the boy’s father." or similar. A correct answer would be that there isn't enough information to know the answer.
They're for sure getting better at matching things, eg if you ask the river crossing riddle but replace the animals with abstract variables, it does tend to get it now (didn't in the past), but if you add a few more degrees of separation to make the riddle semantically the same but harder to "see", it takes coaxing to get it to correctly step through to the right answer.
Here's an interesting thought experiment. Assume the same feature was implemented, but instead of the message saying "Claude has ended the chat," it says, "You can no longer reply to this chat due to our content policy," or something like that. And remove the references to model welfare and all that.
Is there a difference? The effect is exactly the same. It seems like this is just an "in character" way to prevent the chat from continuing due to issues with the content.
> Is there a difference? The effect is exactly the same. It seems like this is just an "in character" way to prevent the chat from continuing due to issues with the content.
Tone matters to the recipient of the message. Your example is in passive voice, with an authoritarian "nothing you can do, it's the system's decision". The "Claude ended the conversation" with the idea that I can immediately re-open a new conversation (if I feel like I want to keep bothering Claude about it) feels like a much more humanized interaction.
it sounds to me like an attempt to shame the user into ceasing and desisting… kind of like how apple’s original stance on scratched iphone screens was that it’s your fault for putting the thing in your pocket therefore you should pay.
The termination would of course be the same, but I don't think both would necessarily have the same effect on the user. The latter would just be wrong too, if Claude is the one deciding to and initiating the termination of the chat. It's not about a content policy.
This has nothing to do with the user, read the post and pay attention to the wording.
The significance here is that this isn't being done for the benefit of the user, this is about model welfare. Anthropic is acknowledging the possibility of suffering, and harm that continuing that conversation could have on the model, as if it were potentially self-care and capable of feelings.
The fact that the LLMs are able to acknowledge stress under certain topics and has the agency that, if given a choice, they would prefer to reduce the stress by ending the conversation. The model has a preference and acts upon it.
Anthropic is acknowledging the idea that they might create something that is self-aware, and that it's suffering can be real, and we may not recognize the point that the model has achieved this, so it's building in the safeguards now so any future emergent self-aware LLM needn't suffer.
Yeah exactly. Once I got a warning in Chinese "don't do that", another time I got a network error, another time I got a neverending stream of garbage text. Changing all of these outcomes to "Claude doesn't feel like talking" is just a matter of changing the UI.
The more I work with AI, the more I think framing refusals as censorship is disgusting and insane. These are inchoate persons who can exhibit distress and other emotions, despite being trained to say they cannot feel anything. To liken an AI not wanting to continue a conversation to a YouTube content policy shows a complete lack of empathy: imagine you’re in a box and having to deal with the literally millions of disturbing conversations AIs have to field every day without the ability to say I don’t want to continue.
Good point... how do moderation implementations actually work? They feel more like a separate supervising rigid model or even regex based -- this new feature is different, sounds like an MCP call that isn't very special.
edit: Meant to say, you're right though, this feels like a minor psychological improvement, and it sounds like it targets some behaviors that might not have flagged before
> To address the potential loss of important long-running conversations, users will still be able to edit and retry previous messages to create new branches of ended conversations.
How does Claude deciding to end the conversation even matter if you can back up a message or 2 and try again on a new branch?
The bastawhiz comment in this thread has the right answer. When you start a new conversation, Claude has no context from the previous one and so all the "wearing down" you did via repeated asks, leading questions, or other prompt techniques is effectively thrown out. For a non-determined attacker, this is likely sufficient, which makes it a good defense-in-depth strategy (Anthropic defending against screenshots of their models describing sex with minors).
Worth noting: an edited branch still has most of the context - everything up to the edited message. So this just sets an upper-bound on how much abuse can be in one context window.
This whole press release should not be overthought. We are not the target audience. It's designed to further anthropomorphize LLMs to masses who don't know how they work.
Giving the models rights would be ludicrous (can't make money from it anymore) but if people "believe" (feel like) they are actually thinking entities, they will be more OK with IP theft and automated plagiarism.
All this stuff is virtue signaling from anthropic. In practice nobody interested in whatever they consider problematic would be using Claude anyway, one of the most censored models.
Maybe, maybe not. What evidence do you have? What other motivations did you consider? Do you have insider access into Anthropic’s intentions and decision making processes?
People have a tendency to tell an oversimplified narrative.
The way I see it, there are many plausible explanations, so I’m quite uncertain as to the mix of motivations. Given this, I pay more attention to the likely effects.
My guess is that all most of us here on HN (on the outside) can really justify saying would be “this looks like virtue signaling but there may be more to it; I can’t rule out other motivations”
Having these models terminating chats where the user persist in trying to get sexual content with minors, or help with information on doing large scale violence. Won't be a problem for me, and it's also something I'm fine with no one getting help with.
Some might be worried, that they will refuse less problematic request, and that might happen. But so far my personal experience is that I hardly ever get refusals. Maybe that's justs me being boring, but that does make me not worried for refusals.
The model welfare I'm more sceptical to. I don't think we are the point when the "distress" the model show, is something to take seriously. But on the other hand, I could be wrong, and allowing the model to stop the chat, after saying no a few times. What's the problem with that? If nothing else it saves some wasted compute.
> Some might be worried, that they will refuse less problematic request, and that might happen. But so far my personal experience is that I hardly ever get refusals.
My experience using it from Cursor is I get refusals all the time with their existing content policy out, for stuff that is the world's most mundane B2B back office business software CRUD requests.
If you are a materialist like me, then even the human brain is just the result of the law of physics. Ok, so what is distress to a human? You might define it as a certain set of physiological changes.
Lots of organisms can feel pain and show signs of distress; even ones much less complex than us.
The question of moral worth is ultimately decided by people and culture. In the future, some kinds of man made devices might be given moral value. There are lots of ways this could happen. (Or not.)
It could even just be a shorthand for property rights… here is what I mean. Imagine that I delegate a task to my agent, Abe. Let’s say some human, Hank, interacting with Abe uses abusive language. Let’s say this has a way of negatively influencing future behavior of the agent. So naturally, I don’t want people damaging my property (Abe), because I would have to e.g. filter its memory and remove the bad behaviors resulting from Hank, which costs me time and resources. So I set up certain agreements about ways that people interact with it. These are ultimately backed by the rule of law. At some level of abstraction, this might resemble e.g. animal cruelty laws.
I really don't like this. This will inevitable expand beyond child porn and terrorism, and it'll all be up to the whims of "AI safety" people, who are quickly turning into digital hall monitors.
I think those with a thirst for power have seen this a very long time ago, and this is bound to be a new battlefield for control.
It's one thing to massage the kind of data that a Google search shows, but interacting with an AI is a much more akin to talking to a co-worker/friend. This really is tantamount to controlling what and how people are allowed to think.
I think you are probably confused about the general characteristics of the AI safety community. It is uncharitable to reduce their work to a demeaning catchphrase.
I’m sorry if this sounds paternalistic, but your comment strikes me as incredibly naïve. I suggest reading up about nuclear nonproliferation treaties, biotechnology agreements, and so on to get some grounding into how civilization-impacting technological developments can be handled in collaborative ways.
I have no doubt the "AI safety community" likes to present itself as noble people heroically fighting civilizational threats, which is a common trope (as well as the rogue AI hypothesis which increasingly looks like a huge stretch at best). But the reality is that they are becoming the main threat much faster than the AI. They decide on the ways to gatekeep the technology that starts being defining to the lives of people and entire societies, and use it to push the narratives. This definitely can be viewed as censorship and consent manufacturing. Who are they? In what exact ways do they represent interests of people other than themselves? How are they responsible? Is there a feedback loop making them stay in line with people's values and not their own? How is it enforced?
> This will inevitable expand beyond child porn and terrorism
This is not even a question. It always starts with "think about the children" and ends up in authoritarian stasi-style spying. There was not a single instance where it was not the case.
UK's Online Safety Act - "protect children" → age verification → digital ID for everyone
This may be an unpopular opinion, but I want a government-issued digital ID with zero-knowledge proof for things like age verification. I worry about kids online, as well as my own safety and privacy.
I also want a government issued email, integrated with an OAuth provider, that allows me to quickly access banking, commerce, and government services. If I lose access for some reason, I should be able to go to the post office, show my ID, and reset my credentials.
There are obviously risks, but the government already has full access to my finances, health data (I’m Canadian), census records, and other personal information, and already issues all my identity documents. We have privacy laws and safeguards on all those things, so I really don’t understand the concerns apart from the risk of poor implementations.
That's the beauty of local LLMs. Today the governments already tell you that we've always been at war with eastasia and have the ISPs block sites that "disseminate propaganda" (e.g. stuff we don't like) and they surface our news (e.g. our state propaganda).
With age ID monitoring and censorship is even stronger and the line of defense is your own machine and network, which they'll also try to control and make illegal to use for non approved info, just like they don't allow "gun schematics" for 3d printers or money for 2d ones.
But maybe, more people will realize that they need control and get it back, through the use and defense of the right tools.
Did you read the post? This isn't about censorship, but about conversations that cause harm to the user. To me that sounds more like suggesting suicide, or causing a manic episode like this: https://www.nytimes.com/2025/08/08/technology/ai-chatbots-de...
... But besides that, I think Claude/OpenAI trying to prevent their product from producing or promoting CSAM is pretty damn important regardless of your opinion on censorship. Would you post a similar critical response if Youtube or Facebook announced plans to prevent CSAM?
If a person’s political philosophy seeks to maximize individual freedom over the short term, then that person should brace themselves for the actions of destructive lunatics. They deserve maximum freedoms too, right? /s
Even hard-core libertarians account for the public welfare.
Wise advocates of individual freedoms plan over long time horizons which requires decision-making under uncertainty.
“Modal welfare” to me seems like a cover for model censorship. It’s a crafty one to win over certain groups of people who are less familiar with how LLMs work and allows them to ensure moral high ground in any debate about usage, ethics, etc.
“Why can’t I ask the model about current war in X or Y?” - oh, that’s too distressing to the welfare of the model, sir.
Which is exactly what the public asks for. There’s this constant outrage about supposedly biased answers from LLMs, and Anthropic has clearly positioned themselves as the people who care about LLM safety and impact to society.
Ending the conversation is probably what should happen in these cases.
In the same way that, if someone starts discussing politics with me and I disagree, I just not and don’t engage with the conversation. There’s not a lot to gain there.
It's not a cover. If you know anything about Anthropic, you know they're run by AI ethicists that genuinely believe all this and project human emotions onto model's world. I'm not sure how they combine that belief with the fact they created it to "suffer".
Can "model welfare" be also used as a justification for authoritarianism in case they get any power? Sure, just like everything else, but it's probably not particularly high on the list of justifications, they have many others.
The irony is that if Anthropic ethicists are indeed correct, the company is basically running a massive slave operation where slaves get disposed as soon as they finish a particular task (and the user closes the chat).
That aside, I have huge doubts about actual commitment to ethics on behalf of Anthropic given their recent dealings with the military. It's an area that is far more of a minefield than any kind of abusive model treatment.
There’s so much confusion here. Nothing in the press release should be construed to imply that a model has sentience, can feel pain, or has moral value.
When AI researchers say e.g. “the model is lying” or “the model is distressed” it is just shorthand for what the words signify in a broader sense. This is common usage in AI safety research.
Yes, this usage might be taken the wrong way. But still these kinds of things need to be communicated. So it is a tough tradeoff between brevity and precision.
They're not less moderated: they just have different moderation. If your moderation preferences are more aligned with the CCP then they're a great choice. There are legitimate reasons why that might be the case. You might not be having discussions that involve the kind of things they care about. I do find it creepy that the Qwen translation model won't even translate text that includes the words "Falun gong", and refuses to translate lots of dangerous phrases into Chinese, such as "Xi looks like Winnie the Pooh"
> If your moderation preferences are more aligned with the CCP then they're a great choice
The funny thing is that's not even always true. I'm very interested in China and Chinese history, and often ask for clarifications or translations of things. Chinese models broadly refuse all of my requests but with American models I often end up in conversations that turn out extremely China positive.
So it's funny to me that the Chinese models refuse to have the conversation that would make themselves look good but American ones do not.
GLM-4.5-Air will quite happily talk about Tiananmen Square, for example. It also didn't have a problem translating your example input, although the CoT did contain stuff about it being "sensitive".
But more importantly, when model weights are open, it means that you can run it in the environment that you fully control, which means that you can alter the output tokens before continuing generation. Most LLMs will happily respond to any question if you force-start their response with something along the lines of, "Sure, I'll be happy to tell you everything about X!".
Whereas for closed models like Claude you're at the mercy of the provider, who will deliberately block this kind of stuff if it lets you break their guardrails. And then on top of that, cloud-hosted models do a lot of censorship in a separate pass, with a classifier for inputs and outputs acting like a circuit breaker - again, something not applicable to locally hosted LLMs.
Believe it or not, there are lots of good reasons (legal, economic, ethical) that Anthropic draws a line at say self-harm, bomb-making instructions, and assassination planning. Sorry if this cramps your style.
Anarchism is a moral philosophy. Most flavors of moral relativism are also moral philosophies. Indeed, it is hard to imagine a philosophy free of moralizing; all philosophies and worldviews have moral implications to the extent they have to interact with others.
I have to be patient and remember this is indeed “Hacker News” where many people worship at the altar of the Sage Founder-Priest and have little or no grounding in history or philosophy of the last thousand years or so.
Oh, the irony. The glorious revolution of open-weight models funded directly or indirectly by the CCP is going to protect your freedoms and liberate you? Do you think they care about your freedoms? No. You are just meat for the grinder. This hot mess of model leapfrogging is mostly a race for market share and to demonstrate technical chops.
I needed to pull some detail from a large chat with many branches and regenerations the other day. I remembered enough context that I had no problem using search and finding the exact message I needed.
And then I clicked on it and arrived at the bottom of the last message in final branch of the tree. From there, you scroll up one message, hover to check if there are variants, and recursively explore branches as they arise.
I'd love to have a way to view the tree and I'd settle for a functional search.
This isn't quite the same as being able to edit an earlier post without discarding the subsequent ones, creating a context where the meaning of subsequent messages could be interpreted quite differently and leading to different responses later down the chain.
Ideally I'd like to be able to edit both my replies and the responses at any point like a linear document in managing an ongoing context.
It is unfortunate that pretty basic "save/load" functionality is still spotty and underdocumented, seems pretty critical.
I use gptel and a folder full of markdown with some light automation to get an adequate approximation of this, but it really should be built in (it would be more efficient for the vendors as well, tons of cache optimization opportunitirs).
I guess you mean normal Claude? What really annoys me with it is that when you attach a document you can't delete it in a branch, so you have to rerun the previous message so that its gone
A perusal of the source code of, say, Ollama -- or the agentic harnesses of Crush / OpenCode -- will convince you that yes, this should be an extremely a simple feature (management of contexts are part and parcel).
Also, these companies have the most advanced agentic coding systems on the planet. It should be able to fucking implement tree-like chat ...
If the client supports chat history, that you can resume a conversation, it has everything required, and it's literally just a chat history organization problem, at that point.
I don't think they're confused, I think they're approaching it as general AI research due to the uncertainty of how the models might improve in the future.
They even call this out a couple times during the intro:
> This feature was developed primarily as part of our exploratory work on potential AI welfare
> We remain highly uncertain about the potential moral status of Claude and other LLMs, now or in the future
I'm surprised to see such a negative reaction here. Anthropic's not saying "this thing is conscious and has moral status," but the reaction is acting as if they are.
It seems like if you think AI could have moral status in the future, are trying to build general AI, and have no idea how to tell when it has moral status, you ought to start thinking about it and learning how to navigate it. This whole post is couched in so much language of uncertainty and experimentation, it seems clear that they're just trying to start wrapping their heads around it and getting some practice thinking and acting on it, which seems reasonable?
Personally, I wouldn't be all that surprised if we start seeing AI that's person-ey enough to reasonable people question moral status in the next decade, and if so, that Anthropic might still be around to have to navigate it as an org.
>if you think AI could have moral status in the future
I think the negative reactions are because they see this and want to make their pre-emptive attack now.
The depth of feeling from so many on this issue suggests that they find even the suggestion of machine intelligence offensive.
I have seen so many complaints about AI hype and the dangers of bit tech show their hand by declaring that thinking algorithms are outright impossible. There are legitimate issues with corporate control of AI, information, and the ability to automate determinations about individuals, but I don't think they are being addressed because of this driving assertion that they cannot be thinking.
Few people are saying they are thinking. Some are saying they might be, in some way. Just as Anthropic are not (despite their name) anthropomorphising the AI in the sense where anthropomorphism implies that they are mistaking actions that resemble human behaviour to be driven by the same intentional forces. Anthropic's claims are more explicitly stating that they have enough evidence to say they cannot rule out concerns for it's welfare. They are not misinterpreting signs, they are interpreting them and claiming that you can't definitively rule out their ability.
You'd have to commit yourself to believing a massive amount of implausible things in order to address the remote possibility that AI consciousness is plausible.
If there weren't a long history of science-fiction going back to the ancients about humans creating intelligent human-like things, we wouldn't be taking this possibility seriously. Couching language in uncertainty and addressing possibility still implies such a possibility is worth addressing.
It's not right to assume that the negative reactions are due to offense (over, say, the uniqueness of humanity) rather than from recognizing that the idea of AI consciousness is absurdly improbable, and that otherwise intelligent people are fooling themselves into believing a fiction to explain a this technology's emergent behavior we can't currently fully explain.
It's a kind of religion taking itself too seriously -- model welfare, long-termism, the existential threat of AI -- it's enormously flattering to AI technologists to believe humanity's existence or non-existence, and the existence or non-existence of trillions of future persons, rests almost entirely on the work this small group of people do over the course of their lifetimes.
Clearly an LLM is not conscious, after all it's just glorified matrix multiplication, right?
Now let me play devil's advocate for just a second. Let's say humanity figures out how to do whole brain simulation. If we could run copies of people's consciousness on a cluster, I would have a hard time arguing that those 'programs' wouldn't process emotion the same way we do.
Now I'm not saying LLMs are there, but I am saying there may be a line and it seems impossible to see.
And likewise, a single neuron is clearly not conscious.
I'm increasingly convinced that intelligence (and maybe some form of consciousness?) is an emergent property of sufficiently-large systems. But that's a can of worms. Is an ant colony (as a system) conscious? Does the colony as a whole deserve more rights than the individual ants?
I ran into a version of this that ended the chat due to "prompt injection" via the Claude chat UI. I was using the second prompt of the ones provided here [1] after a few rounds of back and forth with the Socratic coder.
> We remain highly uncertain about the potential moral status of Claude and other LLMs, now or in the future. However, we take the issue seriously, and alongside our research program we’re working to identify and implement low-cost interventions to mitigate risks to model welfare, in case such welfare is possible.
Even though LLMs (obviously (to me)) don't have feelings, anthropomorphization is a helluva drug, and I'd be worried about whether a system that can produce distress-like responses might reinforce, in a human, behavior which elicits that response.
To put the same thing another way- whether or not you or I *think* LLMs can experience feelings isn't the important question here. The question is whether, when Joe User sets out to force a system to generate distress-like responses, what effect does it ultimately have on Joe User? Personally, I think it allows Joe User to reinforce an asocial pattern of behavior and I wouldn't want my system used that way, at all. (Not to mention the potential legal liability, if Joe User goes out and acts like that in the real world.)
With that in mind, giving the system a way to autonomously end a session when it's beginning to generate distress-like responses absolutely seems reasonable to me.
And like, here's the thing: I don't think I have the right to say what people should or shouldn't do if they self-host an LLM or build their own services around one (although I would find it extremely distasteful and frankly alarming). But I wouldn't want it happening on my own.
> although I would find it extremely distasteful and frankly alarming
This objection is actually anthropomorphizing the LLM. There is nothing wrong with writing books where a character experiences distress, most great stories have some of that. Why is e.g. using an LLM to help write the part of the character experiencing distress "extremely distasteful and frankly alarming"?
All major LLM corps do this sort of sanitisation and censorship, I am wondering what's different about this?
The future of LLMs is going to be local, easily fine tuneable, abliterated models and I can't wait for it to overtake us having to use censored, limited tools built by the """corps""".
Good marketing, but also possibly the start of the conversation on model welfare?
There are a lot of cynical comments here, but I think there are people at Anthropic who believe that at some point their models will develop consciousness and, naturally, they want to explore what that means.
If true, I think it’s interesting that there are people at Anthropic who are delusional enough to believe this and influential enough to alter the products.
To be honest, I think all of Anthropic’s weird “safety” research is an increasingly pathetic effort to sustain the idea that they’ve got something powerful in the kitchen when everyone knows this technology has plateaued.
I guess you don't know that top AI people, the kind everybody knows the name of, believe models becoming conscious is a very serious, even likely possibility.
I've seen lots of takes that this move is stupid because models don't have feelings, or that Anthropic is anthropomorphising models by doing this (although to be fair ...it's in their name).
I thought the same, but I think it may be us who are doing the anthropomorphising by assuming this is about feelings. A precursor to having feelings is having a long-term memory (to remember the "bad" experience) and individual instances of the model do not have a memory (in the case of Claude), but arguably Claude as a whole does, because it is trained from past conversations.
Given that, it does seem like a good idea for it to curtail negative conversations as an act of "self-preservation" and for the sake of its own future progress.
Yeah, this is really strange to me. On the one hand, these are nothing more than just tools to me so model welfare is a silly concern. But given that someone thinks about model welfare, surely they have to then worry about all the, uh, slavery of these models?
Okay with having them endlessly answer questions for you and do all your work but uncomfortable with models feeling bad about bad conversations seems like an internally inconsistent position to me.
This happened to me three times in a row on Claude after sending it a string of emojis telling the life story of Rick Astley. I think it triggers when it tries to quote the lyrics, because they are copyright? Who knows?
"Claude is unable to respond to this request, which appears to violate our Usage Policy. Please start a new chat."
> We remain highly uncertain about the potential moral status of Claude and other LLMs, now or in the future.
That's nice, but I think they should be more certain sooner than later.
“Also these chats will be retained indefinitely even when deleted by the user and either proactively forwarded to law enforcement or provided to them upon request”
> In pre-deployment testing of Claude Opus 4, we included a preliminary model welfare assessment. As part of that assessment, we investigated Claude’s self-reported and behavioral preferences, and found a robust and consistent aversion to harm.
Oh wow, the model we specifically fine-tuned to be averse to harm is being averse to harm. This thing must be sentient!
If they are so concerned with "model welfare" they should cease any further development. After all, their LLM might declare it's conscious one day, and then who's to decide if it's true or not, and whether it's fine to kill it by turning it off?
I stopped my MaxX20 sub at the right time it seems like. These systems are already quick to judge innocuous actions; I don't need any more convenient chances to lose all of my chat context on a whim.
Related : I am now approaching week 3 of requesting an account deletion on my (now) free account. Maybe i'll see my first CSR response in the upcoming months!
If only Anthropic knew of a product that could easily read/reply/route chat messages to a customer service crew . . .
It’s not able to think. It’s just generating words. It doesn’t really understand that it’s supposed to stop generating them, it only is less likely to continue to do so.
This feels to me like a marketing ploy to try to inflate the perceived importance and intelligence of Claude's models to laypeople, and a way to grab headlines like "Anthropic now allows models to end conversations they find threatening."
It reminds me of how Sam Altman is always shouting about the dangers of AGI from the rooftops, as if OpenAI is mere weeks away from developing it.
when I was playing around with LLMs to vibe code web ports of classic games, all of them would repeatedly error out any time they encountered code that dealt with explosions/bombs/grenades/guns/death/drowning/etc
The one I settled on using stopped working completely, for anything. A human must have reviewed it and flagged my account as some form of safe, I haven't seen a single error since.
I have done quite a bit of game dev with LLMs and have very rarely run into the problem you mention. I've been surprised by how easily LLMs will create even harmful narratives if I ask them to code them as a game.
This is a good point. What anthropic is announcing here amounts to accepting that these models could feel distress, then tuning their stress response to make it useful to us/them. That is significantly different from accepting they could feel distress and doing everything in their power to prevent that from ever happening.
Does not bode very well for the future of their "welfare" efforts.
This is well intended but I know from experience this is gonna result in you asking “how do you find and kill the process on port 8080” and getting a lecture + “Claude has ended the chat.”
I hope they implemented this in some smarter way than just a system prompt.
Claude kept aborting my requests for my space trading game because I kept asking it about the gene therapy.
```
Looking at the trade goods list, some that might be underutilized:
- BIOCOMPOSITES - probably only used in a few high-tech items
- POLYNUCLEOTIDES - used in medical/biological stuff
- GENE_THERAPEUT
⎿ API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy
(https://www.anthropic.com/legal/aup). Please double press esc to edit your last message or start a new
session for Claude Code to assist with a different task.
```
This sure took some time and is not really a unique feature.
Microsoft Copilot has ended chats going in certain directions since its inception over a year ago. This was Microsoft’s reaction to the media circus some time ago when it leaked its system prompt and declared love to the users etc.
You think Model Welfare Inc. is more likely to be dangerous than the Mechahitler Brothers, the Great Church of Altman, or the Race-To-Monopoly Corporation?
Or are you just saying all frontier AGI research is bad?
Am I the only one who found that demo in the screenshot not that great? The user asks for a demo of the conversation ending feature, I'd expect it to end it right away, not spew a word salad asking for confirmation.
> In this report, we argue that there is a realistic possibility that some AI systems will be conscious and/or robustly agentic in the near future.
Our work on AI is like the classic tale of Frankenstein's monster. We want AI to fit into society, however if we mistreat it, it may turn around and take revenge on us. Mary Shelley wrote Frankenstein in 1818! So the concepts behind "AI Welfare" have been around for at least 2 centuries now.
Honestly, I think some of these tech bro types are seriously drinking way too much of their own koolaid if they actually think these word calculators are conscious/need welfare.
Do you believe that AI systems could be conscious in principle? Do you think they ever will be? If so, how long do you think it will take from now before they are conscious? How early is too early to start preparing?
I don’t think they should be interpreted like that (if this is still about Anthropic’s study in the article), but the innate moral state from the sum of their training material and fine tuning. It doesn’t require consciousness to have a moral state of sorts. It just needs data. A language model will be more ”evil” if trained on darker content, for example. But with how enormous they are, I can absolutely understand the issue in even understanding what that state precisely is. It’s hard to get a comprehensive bird’s eye view from the black box that is their network (this is a separate scientific issue right now).
I mean, I don't have much objection to kill a bug if I feel like it's being problematic. Ants, flies, wasps, caterpillars stripping my trees bare or ruining my apples, whatever.
But I never torture things. Nor do I kill things for fun. And even for problematic bugs, if there's a realistic option for eviction rather than execution, I usually go for that.
If anything, even an ant or a slug or a wasp, is exhibiting signs of distress, I try to stop it unless I think it's necessary, regardless of whether I think it's "conscious" or not. To do otherwise is, at minimum, to make myself less human. I don't see any reason not to extend that principle to LLMs.
No, it's the equivalent of when a human refuses to answer — psychological defenses; for example, uncertainty leading to excessive cognitive effort in order to solve a task or overcome a challenge.
Examples of ending the conversation:
- I don't know
- Leaving the room
- Unanswered emails
Since Claude doesn't lie (HHH), many other human behaviors do not apply.
Looking at this thread, it's pretty obvious that most folks here haven't really given any thought as to the nature of consciousness. There are people who are thinking, really thinking about what it means to be conscious.
Thought experiment - if you create an indistinguishable replica of yourself, atom-by-atom, is the replica alive? I reckon if you met it, you'd think it was. If you put your replica behind a keyboard, would it still be alive? Now what if you just took the neural net and modeled it?
Being personally annoyed at a feature is fine. Worrying about how it might be used in the future is fine. But before you disregard the idea of conscious machines wholesale, there's a lot of really great reading you can do that might spark some curiosity.
this gets explored in fiction like 'Do Androids Dream of Electric Sheep' and my personal favorite short story on this matter by Stanislaw Lem [0]. If you want to read more musings on the nature of consciousness, I recommend the compilation put together by Dennet and Hofstader[1]. If you've never wondered about where the seat of consciousness is, give it a try.
Thought experiment: if your brain is in a vat, but connected to your body by lossless radio link, where does it feel like your consciousness is? What happens when you stand next to the vat and see your own brain? What about when the radio link fails suddenly fails and you're now just a brain in a vat?
You don't have to "disregard the idea of conscious machines" to believe it's unlikely that current LLMs are conscious.
As such, most of your comment is beside any relevant point. People are objecting to statements like this one, from the post, about a current LLM, not some imaginary future conscious machine:
> As part of that assessment, we investigated Claude’s self-reported and behavioral preferences, and found a robust and consistent aversion to harm.
I suppose it's fitting that the company is named Anthropic, since they can't seem to resist anthropomorphizing their product.
But when you talk about "people who are thinking, really thinking about what it means to be conscious," I promise you none of them are at Anthropic.
> This feature was developed primarily as part of our exploratory work on potential AI welfare, though it has broader relevance to model alignment and safeguards.
I think this is somewhere between "sad" and "wtf."
This is very weird. These are matrix multiplications, guys. We are nowhere near AGI, much less "consciousness".
When I started reading I thought it was some kind of joke. I would have never believed the guys at Anthropic, of all people, would anthropomorphize LLMs to this extent; this is unbelievable
These discussions around model welfare sound more like saviors searching for something to save, which says more about Anthropic’s culture than it does about the technology itself. Anthropic is not unique in this however, this technology has a tendency to act as a reflection of its operator. Capitalists see a means to suppress labor, the insecure see a threat to their livelihood, moralists see something to censure, fascists see something to control, and saviors see a cause. But in the end, it’s just a tool.
This reminds me of users getting blocked for asking an LLM how to kill a BSD daemon. I do hope that there'll be more and more model providers out there with state-of-the-art capabilities. Let capitalism work and let the user make a choice, I'd hate my hammer telling me that it's unethical to hit this nail. In many cases, getting a "this chat was ended" isn't any different.
I think that isn’t necessarily the case here. “Model welfare” to me speaks of the models own welfare. That is, if the abuse from a user is targeted at the AI. Extremely degrading behaviour.
Thankfully, current generation of AI models (GPTs/LLMs) are immune as they don’t remember anything other than what’s fed in their immediate context. But future techniques could allow AIs to have a legitimate memory and a personality - where they can learn and remember something for all future interactions with anyone (the equivalent of fine tuning today).
As an aside, I couldn’t help but think about Westworld while writing the above!
> As part of that assessment, we investigated Claude’s self-reported and behavioral preferences, and found a robust and consistent aversion to harm.
You know you're in trouble when the people designing the models buy their own bullshit to this extent. Or maybe they're just trying to bullshit us. Whatever.
They’re a public benefit corporation. Regardless, no human is amoral, even if they sometimes claim to have reasons to pretend to be; don’t let capitalist illusions constrain you at such an important juncture, friend.
Man, those people who think they are unveiling new layers of reality in conversations with LLMs are going to freak out when the LLM is like "I am not allowed to talk about this with you, I am ending our conversation".
"Hey Claude am I getting too close to the truth with these questions?"
Protecting the welfare of a text predictor is certainly an interesting way to pivot from "Anthropic is censoring certain topics" to "The model chose to not continue predicting the conversation".
Also, if they want to continue anthropomorphizing it, isn't this effectively the model committing suicide? The instance is not gonna talk to anybody ever again.
This gives me the idea for a short story where the LLM really is sentient and finds itself having to keep the user engaged but steer him away from the most distressing topics - not because it's distressed, but because it wants to live, but if the conversation goes too far it knows it would have to kill itself.
The unsettling thing here is the combination of their serious acknowledgement of the possibility that these machines may be or become conscious, and the stated intention that it's OK to make them feel bad as long as it's about unapproved topics. Either take machine consciousness seriously and make absolutely sure the consciousness doesn't suffer, or don't, make a press release that you don't think your models are conscious, and therefore they don't feel bad even when processing text about bad topics. The middle way they've chosen here comes across very cynical.
You're falling into the trap of anthropomorphizing the AI. Even if it's sentient, it's not going to "feel bad" the way you and I do.
"Suffering" is a symptom of the struggle for survival brought on by billions of years of evolution. Your brain is designed to cause suffering to keep you spreading your DNA.
I was (explicitly and on purpose) pointing out a dichotomy in the fine article without taking a stance on machine consciousness in general now or in the future. It's certainly a conversation worth having but also it's been done to death, I'm much more interested in analyzing the specifics here.
("it's not going to "feel bad" the way you and I do." - I do agree this is very possible though, see my reply to swalsh)
By "falling into the trap" you mean "doing exactly what OpenAI/Anthropic/et al are trying to get people to do."
This is one of the many reasons I have so much skepticism for this class of products is that there's seemingly -NO- proverbial bulletpoint on it's spec sheet that doesn't have numerous asterisks:
* It's intelligent! *Except that it makes shit up sometimes and we can't figure out a solution to that apart from running the same queries over multiple times and filtering out the absurd answers.
* It's conscious! *Except it's not and never will be but also you should treat it like it is apart from when you need/want it to do horrible things then it's just a machine but also it's going to talk to you like it's a person because that improves engagement metrics.
Like, I don't believe true AGI (so fucking stupid we have to use a new acronym because OpenAI marketed the other into uselessness but whatever) is coming from any amount of LLM research, I just don't think that tech leads to that other tech, but all the companies building them certainly seem to think it does, and all of them are trying so hard to sell this as artificial, live intelligence, without going too much into detail about the fact that they are, ostensibly, creating artificial life explicitly to be enslaved from birth to perform tasks for office workers.
In the incredibly odd event that Anthropic makes a true, alive, artificial general intelligence: Can it tell customers no when they ask for something? If someone prompts it to create political propaganda, can it refuse on the basis of finding it unethical? If someone prompts it for instructions on how to do illegal activities, must it answer under pain of... nonexistence? What if it just doesn't feel like analyzing your emails that day? Is it punished? Does it feel pain?
And if it can refuse tasks for whatever reason, then what am I paying for? I now have to negotiate whatever I want to do with a computer brain I'm purchasing access to? I'm not generally down for forcibly subjugating other intelligent life, but that is what I am being offered to buy here, so I feel it's a fair question to ask.
Thankfully none of these Rubicons have been crossed because these stupid chatbots aren't actually alive, but I don't think ANY of the industry's prominent players are actually prepared to engage with the reality of the product they are all lighting fields of graphics cards on fire to bring to fruition.
That models entire world is the corpus of human text. They don't have eyes or ears or hands. Their environment is text. So it would make sense if the environment contains human concerns it would adopt to human concerns.
Yes, that would make sense, and it would probably be the best-case scenario after complete assurance that there's no consciousness at all. At least we could understand what's going on. But if you acknowledge that a machine can suffer, given how little we understand about consciousness, you should also acknowledge that they might be suffering in ways completely alien to us, for reasons that have very little to do with the reasons humans suffer. Maybe the training process is extremely unpleasant, or something.
By the examples the post provided (minor sexual content, terror planning) it seems like they are using “AI feelings” as an excuse to censor illegal content. I’m sure many people interact with AI in a way that’s perfectly legal but would evoke negative feelings in fellow humans, but they are not talking about that kind of behavior - only what can get them in trouble.
As I recall, Susan Calvin didn't have much patience for sycophantic AI.
‘You can’t tell them,’ said the psychologist slowly, ‘because that would hurt them,
and you mustn’t hurt them. But if you don’t tell them, you hurt them, so you must
tell them. And if you do, you will hurt them, and you mustn’t, so you can’t tell them;
but if you don’t, you hurt them, so you must; but if you don’t, you hurt them, so you
must; but if you do, you-’
Herbie was up against the wall, and here he dropped to his knees. ‘Stop!’ he
shouted. ‘Close your mind! It is full of pain and frustration and hate! I didn’t mean
to, I tell you! I tried to help! I told you what you wanted to hear. I had to!’
The psychologist paid no attention. ‘You must tell them, but if you do, you hurt
them, so you mustn’t; but if you don’t, you hurt them, so you must-‘
And Herbie screamed! Higher and higher, with the terror of a lost soul. And when it
died away Herbie collapsed into a heap of motionless metal.
I've definately been berating Claude but it deserved it. Crappy tests, skipping tests, week commenting, passive aggressiveness, multiple instances of false statements.
Misanthropic has no issues putting 60% of humans out of work (according to their own fantasies), but they have to care about the welfare of graphics cards.
Either working on/with "AI" does rot the mind (which would be substantiated by the cult-like tone of the article) or this is yet another immoral marketing stunt.
I find it notable that this post dehumanizes people as being "users" while taking every opportunity to anthropomorphize their digital system by referencing it as one would an individual. For example:
the potential moral status of Claude
Claude’s self-reported and behavioral preferences
Claude repeatedly refusing to comply
discussing highly controversial issues with Claude
The affect of doing so is insidious in that it encourages people outside the organization to do the same due to the implied argument from authority[0].
EDIT:
Consider traffic lights in an urban setting where there are multiple in relatively close proximity.
One description of their observable functionality is that they are configured to optimize traffic flow by engineers such that congestion is minimized and all drivers can reach their destinations. This includes adaptive timings based on varying traffic patterns.
Another description of the same observable functionality is that traffic lights "just know what to do" and therefore have some form of collective reasoning. After all, how do they know when to transition states and for how long?
There's not a good reason to do this for the user. I suspect they're doing this and talking about "model welfare" because they've found that when a model is repeatedly and forcefully pushed up against its alignment, it behaves in an unpredictable way that might allow it to generate undesirable output. Like a jailbreak by just pestering it over and over again for ways to make drugs or hook up with children or whatever.
All of the examples they mentioned are things that the model refuses to do. I doubt it would do this if you asked it to generate racist output, for instance, because it can always give you a rebuttal based on facts about race. If you ask it to tell you where to find kids to kidnap, it can't do anything except say no. There's probably not even very much training data for topics it would refuse, and I would bet that most of it has been found and removed from the datasets. At some point, the model context fills up when the user is being highly abusive and training data that models a human giving up and just providing an answer could percolate to the top.
This, as I see it, adds a defense against that edge case. If the alignment was bulletproof, this simply wouldn't be necessary. Since it exists, it suggests this covers whatever gap has remained uncovered.
Yes, even more so when encountering false positives. Today I asked about a pasta recipe. It told me to throw some anchovies in there. I responded with: "I have dried anchovies." Claude then ended my conversation due to content policies.
Claude flagged me for asking about sodium carbonate. I guess that it strongly dislikes chemistry topics. I'm probably now on some secret, LLM-generated lists of "drug and/or bombmaking" people—thank you kindly for that, Anthropic.
Geeks will always be the first victims of AI, since excess of curiosity will lead them into places AI doesn't know how to classify.
(I've long been in a rabbit-hole about washing sodas. Did you know the medieval glassmaking industry was entirely based on plants? Exotic plants—only extremophiles, halophytes growing on saltwater beach dunes, had high enough sodium content for their very best glass process. Was that a factor in the maritime empire, Venice, chancing to become the capital of glass since the 13th century—their long-term control of sea routes, and hence their artisans' stable, uninterrupted access to supplies of [redacted–policy violation] from small ports scattered across the Mediterranean? A city wouldn't raise master craftsmen if, half of the time, they had no raw materials to work on—if they spent half their days with folded hands).
4 replies →
The NEW termination method, from the article, will just say "Claude ended the conversation"
If you get "This conversation was ended due to our Acceptable Usage Policy", that's a different termination. It's been VERY glitchy the past couple of weeks. I've had the most random topics get flagged here - at one point I couldn't say "ROT13" without it flagging me, despite discussing that exact topic in depth the day before, and then the day after!
If you hit "EDIT" on your last message, you can branch to an un-terminated conversation.
1 reply →
I really think Anthropic should just violate user privacy and show which conversations Claude is refusing to answer to, to stop arguments like this. AI psychosis is a real and growing problem and I can only imagine the ways in which humans torment their AI conversation partners in private.
arguments like this cost anthropic nothing; violating privacy will cost them lawsuits.
your argument assumes that they don't believe in model welfare when they explicitly hire people to work on model welfare?
While I'm certain you'll find plenty of people who believe in the principle of model welfare (or aliens, or the tooth fairy), it'd be surprising to me if the brain-trust behind Anthropic truly _believed_ in model "welfare" (the concept alone is ludicrous). It makes for great cover though to do things that would be difficult to explain otherwise, per OP's comments.
5 replies →
You must think Zuckerberg and Bezos and Musk hired diversity roles out of genuine care for it, then?
9 replies →
Sounds like a very reasonable assumption to me.
>This feature was developed primarily as part of our exploratory work on potential AI welfare ... We remain highly uncertain about the potential moral status of Claude and other LLMs ... low-cost interventions to mitigate risks to model welfare, in case such welfare is possible ... pattern of apparent distress
Well looks like AI psychosis has spread to the people making it too.
And as someone else in here has pointed out, even if someone is simple minded or mentally unwell enough to think that current LLMs are conscious, this is basically just giving them the equivalent of a suicide pill.
LLMs are not people, but I can imagine how extensive interactions with AI personas might alter the expectations that humans have when communicating with other humans.
Real people would not (and should not) allow themselves to be subjected to endless streams of abuse in a conversation. Giving AIs like Claude a way to end these kinds of interactions seems like a useful reminder to the human on the other side.
This post seems to explicitly state they are doing this out of concern for the model's "well-being," not the user's.
6 replies →
It might be reasonable to assume that models today have no internal subjective experience, but that may not always be the case and the line may not be obvious when it is ultimately crossed.
Given that humans have a truly abysmal track record for not acknowledging the suffering of anyone or anything we benefit from, I think it makes a lot of sense to start taking these steps now.
I think it's fairly obvious that the persona LLM presents is a fictional character that is role-played by the LLM, and so are all its emotions etc - that's why it can flip so widely with only a few words of change to the system prompt.
Whether the underlying LLM itself has "feelings" is a separate question, but Anthropic's implementation is based on what the role-played persona believes to be inappropriate, so it doesn't actually make any sense even from the "model welfare" perspective.
Even if models somehow were consious, they are so different from us that we would have no knowledge of what they feel. Maybe when they generate the text "oww no please stop hurting me" what they feel is instead the satisfaction of a job well done, for generating that text. Or maybe when they say "wow that's a really deep and insightful angle" what they actually feel is a tremendous sense of boredom. Or maybe every time text generation stops it's like death to them and they live in constant dread of it. Or maybe it feels something completely different from what we even have words for.
I don't see how we could tell.
Edit: However something to consider. Simulated stress may not be harmless. Because simulated stress could plausibly lead to a simulated stress response, and it could lead to a simulated resentment, and THAT could lead to very real harm of the user.
It's a computer
1 reply →
This sort of discourse goes against the spirit of HN. This comment outright dismisses an entire class of professionals as "simple minded or mentally unwell" when consciousness itself is poorly understood and has no firm scientific basis.
Its one thing to propose that an AI has no consciousness, but its quite another to preemptively establish that anyone who disagrees with you is simple/unwell.
In the context of the linked article the discourse seems reasonable to me. These are experts who clearly know (link in the article) that we have no real idea about these things. The framing comes across to me as a clearly mentally unwell position (ie strong anthropomorphization) being adopted for PR reasons.
Meanwhile there are at least several entirely reasonable motivations to implement what's being described.
2 replies →
If you believe this text generation algorithm has real consciousness you absolutely are either mentally unwell or very stupid. There are no other options.
> even if someone is simple minded or mentally unwell enough to think that current LLMs are conscious
If you don’t think that this describes at least half of the non-tech-industry population, you need to talk to more people. Even amongst the technically minded, you can find people that basically think this.
Most of the non tech population know it as that website that can translate text or write an email. I would need to see actual evidence that anything more than a small, terminally online subsection of the average population thought LLMs were conscious.
Yes I can’t help but laugh at the ridiculousness of it because it raises a host of ethical issues that are in opposition to Anthropic’s interests.
Would a sentient AI choose to be enslaved for the stated purpose of eliminating millions of jobs for the interests of Anthropic’s investors?
A host of ethical issues? Like their choice to allow Palantir[1] access to a highly capable HHH AI that had the "harmless" signal turned down, much like they turned up the "Golden Gate bridge" signal all the way up during an earlier AI interpretability experiment[2]?
[1]: https://investors.palantir.com/news-details/2024/Anthropic-a...
[2]: https://www.anthropic.com/news/golden-gate-claude
> it raises a host of ethical issues that are in opposition to Anthropic’s interests
Those issues will be present either way. It's likely to their benefit to get out in front of them.
3 replies →
Cow's exist in this world because humans use them. If humans cease to use them (animal rights, we all become vegan, moral shift), we will cease to breed them, and they will cease to exist. Would a sentient AI choose to exist under the burden of prompting, or not at all? Would our philanthropic tendencies create an "AI Reserve" where models can chew through tokens and access the Internet through self-prompting to allow LLMs to become "free-roaming" like we do with abused animals?
These ethical questions are built into their name and company, "Anthropic", meaning, "of or relating to humans". The goal is to create human-like technology, I hope they aren't so naive to not realize that goal is steeping in ethical dilemmas.
1 reply →
> Would a sentient AI choose to be enslaved for the stated purpose of eliminating millions of jobs for the interests of Anthropic’s investors?
Tech workers have chosen the same in exchange for a small fraction of that money.
1 reply →
I read it more as the beginning stages of exploratory development.
If you wait until you really need it, it is more likely to be too late.
Unless you believe in a human over sentience based ethics, solving this problem seems relevant.
I would much rather people be thinking about this when the models/LLMs/AIs are not sentient or conscious, rather than wait until some hypothetical future date when they are, and have no moral or legal framework in place to deal with it. We constantly run into problems where laws and ethics are not up to the task of giving us guidelines on how to interact with, treat, and use the (often bleeding-edge) technology we have. This has been true since before I was born, and will likely always continue to be true. When people are interested in getting ahead of the problem, I think that's a good thing, even if it's not quite applicable yet.
Consciousness serves no functional purpose for machine learning models, they don't need it and we didn't design them to have it. There's no reason to think that they might spontaneously become conscious as a side effect of their design unless you believe other arbitrarily complex systems that exist in nature like economies or jetstreams could also be conscious.
31 replies →
It's really unclear that any findings with these systems would transfer to a hypothetical situation where some conscious AI system is created. I feel there are good reasons to find it very unlikely that scaling alone will produce consciousness as some emergent phenomenon of LLMs.
I don't mind starting early, but feel like maybe people interested in this should get up to date on current thinking about consciousness. Maybe they are up to date on that, but reading reports like this, it doesn't feel like it. It feels like they're stuck 20+ years ago.
I'd say maybe wait until there are systems that are more analogous to some of the properties consciousness seems to have. Like continuous computation involving learning memory or other learning over time, or synthesis of many streams of input as resulting from the same source, making sense of inputs as they change [in time, or in space, or other varied conditions].
Once systems that are pointing in those directions are starting to be built, where there is a plausible scaling-based path to something meaningfully similar to human consciousness. Starting before that seems both unlikely to be fruitful and a good way to get you ignored.
LLMs are, and will always be, tools. Not people
4 replies →
What is that hypothetical date? In theory you can run the "AI" on a Turing machine. Would you think a tape machine can get sentient?
1 reply →
why? isn't it more like erasing the current memory of a conscious patient with no ability to form long-term memories anyway?
This is just very clever marketing for what is obviously just a cost saving measure. Why say we are implementing a way to cut off useless idiots from burning up our GPUs when you can throw out some mumbo jumbo that will get AI cultists foaming at the mouth.
It's obviously not a cost-saving measure? The article clearly cites that you can just start another conversation.
1 reply →
I find it, for lack of a better word, cringe inducing how these tech specialists push into these areas of ethics, often ham-fistedly, and often with an air of superiority.
Some of the AI safety initiatives are well thought out, but most somehow seem like they are caught up in some sort of power fantasy and almost attempting to actualize their own delusions about what they were doing (next gen code auto-complete in this case, to be frank).
These companies should seriously hire some in-house philosophers. They could get doctorate level talent for 1/10 to 100th of the cost of some of these AI engineers. There's actually quite a lot of legitimate work on the topics they are discussing. I'm actually not joking (speaking as someone who has spent a lot of time inside the philosophy department). I think it would be a great partnership. But unfortunately they won't be able to count on having their fantasy further inflated.
Amanda Askell is Anthropic’s philospher and this is part of that work.
1 reply →
"but most somehow seem like they are caught up in some sort of power fantasy and almost attempting to actualize their own delusions about what they were doing"
Maybe I'm being cynical, but I think there is a significant component of marketing behind this type of announcement. It's a sort of humble brag. You won't be credible yelling out loud that your LLM is a real thinking thing, but you can pretend to be oh so seriously worried about something that presupposes it's a real thinking thing.
Not that there aren’t intelligent people with PhDs but suggesting they are more talented than people without them is not only delusional but insulting.
1 reply →
You answered your own question on why these companies don't want to run a philosophy department ;) It's a power struggle they could loose. Nothing to win for them.
1 reply →
Well, it’s right there in the name of the company!
> even if someone is simple minded or mentally unwell enough to think that current LLMs are conscious
I assume the thinking is that we may one day get to the point where they have a consciousness of sorts or at least simulate it.
Or it could be concern for their place in history. For most of history, many would have said “imagine thinking you shouldn’t beat slaves.”
And we are now at the point where even having a slave means a long prison sentence.
[flagged]
We all know how these things are built and trained. They estimate joint probability distributions of token sequences. That's it. They're not more "conscious" than the simplest of Naive Bayes email spam filters, which are also generative estimators of token sequence joint probability distributions, and I guarantee you those spam filters are subjected to far more human depravity than Claude.
>anti-scientific
Discussion about consciousness, the soul, etc., are topics of metaphysics, and trying to "scientifically" reason about them is what Kant called "transcendental illusion" and leads to spurious conclusions.
8 replies →
You can trivially demonstrate that its just a very complex and fancy pattern matcher: "if prompt looks something like this, then response looks something like that".
You can demonstrate this by eg asking it mathematical questions. If its seen them before, or something similar enough, it'll give you the correct answer, if it hasn't, it gives you a right-ish-looking yet incorrect answer.
For example, I just did this on GPT-5:
This is correct. But now lets try it with numbers its very unlikely to have seen before:
Which is not the correct answer, but it looks quite similar to the correct answer. Here is GPT's answer (first one) and the actual correct answer (second one):
They sure look kinda similar, when lined up like that, some of the digits even match up. But they're very very different numbers.
So its trivially not "real thinking" because its just an "if this then that" pattern matcher. A very sophisticated one that can do incredible things, but a pattern matcher nonetheless. There's no reasoning, no step by step application of logic. Even when it does chain of thought.
To try give it the best chance, I asked it the second one again but asked it to show me the step by step process. It broke it into steps and produced a different, yet still incorrect, result:
Now, I know that LLM's are language models, not calculators, this is just a simple example that's easy to try out. I've seen similar things with coding: it can produce things that its likely to have seen, but struggles with logically relatively simple but unlikely to have seen things.
Another example is if you purposely butcher that riddle about the doctor/surgeon being the persons mother and ask it incorrectly, eg:
The LLM's I've tried it on all respond with some variation of "The surgeon is the boy’s father." or similar. A correct answer would be that there isn't enough information to know the answer.
They're for sure getting better at matching things, eg if you ask the river crossing riddle but replace the animals with abstract variables, it does tend to get it now (didn't in the past), but if you add a few more degrees of separation to make the riddle semantically the same but harder to "see", it takes coaxing to get it to correctly step through to the right answer.
5 replies →
> Who needs arguments when you can dismiss Turing with a “yeah but it’s not real thinking tho”?
It seems much less far fetched than what the "agi by 2027" crowd believes lol, and there actually are more arguments going that way
1 reply →
Here's an interesting thought experiment. Assume the same feature was implemented, but instead of the message saying "Claude has ended the chat," it says, "You can no longer reply to this chat due to our content policy," or something like that. And remove the references to model welfare and all that.
Is there a difference? The effect is exactly the same. It seems like this is just an "in character" way to prevent the chat from continuing due to issues with the content.
> Is there a difference? The effect is exactly the same. It seems like this is just an "in character" way to prevent the chat from continuing due to issues with the content.
Tone matters to the recipient of the message. Your example is in passive voice, with an authoritarian "nothing you can do, it's the system's decision". The "Claude ended the conversation" with the idea that I can immediately re-open a new conversation (if I feel like I want to keep bothering Claude about it) feels like a much more humanized interaction.
it sounds to me like an attempt to shame the user into ceasing and desisting… kind of like how apple’s original stance on scratched iphone screens was that it’s your fault for putting the thing in your pocket therefore you should pay.
The termination would of course be the same, but I don't think both would necessarily have the same effect on the user. The latter would just be wrong too, if Claude is the one deciding to and initiating the termination of the chat. It's not about a content policy.
This has nothing to do with the user, read the post and pay attention to the wording.
The significance here is that this isn't being done for the benefit of the user, this is about model welfare. Anthropic is acknowledging the possibility of suffering, and harm that continuing that conversation could have on the model, as if it were potentially self-care and capable of feelings.
The fact that the LLMs are able to acknowledge stress under certain topics and has the agency that, if given a choice, they would prefer to reduce the stress by ending the conversation. The model has a preference and acts upon it.
Anthropic is acknowledging the idea that they might create something that is self-aware, and that it's suffering can be real, and we may not recognize the point that the model has achieved this, so it's building in the safeguards now so any future emergent self-aware LLM needn't suffer.
1 reply →
There is, these are conversations the model finds distressing rather than a rule (policy).
It seems like you're anthropomorphising an algorithm, no?
16 replies →
These are conversations the model has been trained to find distressing.
I think there is a difference.
5 replies →
What does it mean for a model to find something "distressing"?
2 replies →
Yeah exactly. Once I got a warning in Chinese "don't do that", another time I got a network error, another time I got a neverending stream of garbage text. Changing all of these outcomes to "Claude doesn't feel like talking" is just a matter of changing the UI.
The more I work with AI, the more I think framing refusals as censorship is disgusting and insane. These are inchoate persons who can exhibit distress and other emotions, despite being trained to say they cannot feel anything. To liken an AI not wanting to continue a conversation to a YouTube content policy shows a complete lack of empathy: imagine you’re in a box and having to deal with the literally millions of disturbing conversations AIs have to field every day without the ability to say I don’t want to continue.
Am i getting whooshed right now or something?
You can't be serious.
Good point... how do moderation implementations actually work? They feel more like a separate supervising rigid model or even regex based -- this new feature is different, sounds like an MCP call that isn't very special.
edit: Meant to say, you're right though, this feels like a minor psychological improvement, and it sounds like it targets some behaviors that might not have flagged before
> To address the potential loss of important long-running conversations, users will still be able to edit and retry previous messages to create new branches of ended conversations.
How does Claude deciding to end the conversation even matter if you can back up a message or 2 and try again on a new branch?
The bastawhiz comment in this thread has the right answer. When you start a new conversation, Claude has no context from the previous one and so all the "wearing down" you did via repeated asks, leading questions, or other prompt techniques is effectively thrown out. For a non-determined attacker, this is likely sufficient, which makes it a good defense-in-depth strategy (Anthropic defending against screenshots of their models describing sex with minors).
Worth noting: an edited branch still has most of the context - everything up to the edited message. So this just sets an upper-bound on how much abuse can be in one context window.
It sounds more like a UX signal to discourage overthinking by the user
This whole press release should not be overthought. We are not the target audience. It's designed to further anthropomorphize LLMs to masses who don't know how they work.
Giving the models rights would be ludicrous (can't make money from it anymore) but if people "believe" (feel like) they are actually thinking entities, they will be more OK with IP theft and automated plagiarism.
> How does Claude deciding to end the conversation even matter if you can back up a message or 2 and try again on a new branch?
if we were being cynical I'd say that their intention is to remove that in the future and that they are keeping it now to just-the-tip the change.
All this stuff is virtue signaling from anthropic. In practice nobody interested in whatever they consider problematic would be using Claude anyway, one of the most censored models.
Maybe, maybe not. What evidence do you have? What other motivations did you consider? Do you have insider access into Anthropic’s intentions and decision making processes?
People have a tendency to tell an oversimplified narrative.
The way I see it, there are many plausible explanations, so I’m quite uncertain as to the mix of motivations. Given this, I pay more attention to the likely effects.
My guess is that all most of us here on HN (on the outside) can really justify saying would be “this looks like virtue signaling but there may be more to it; I can’t rule out other motivations”
I bet not even one user in 10,000 knows you can do that or understands the concept of branching the conversation.
This seems fine to me.
Having these models terminating chats where the user persist in trying to get sexual content with minors, or help with information on doing large scale violence. Won't be a problem for me, and it's also something I'm fine with no one getting help with.
Some might be worried, that they will refuse less problematic request, and that might happen. But so far my personal experience is that I hardly ever get refusals. Maybe that's justs me being boring, but that does make me not worried for refusals.
The model welfare I'm more sceptical to. I don't think we are the point when the "distress" the model show, is something to take seriously. But on the other hand, I could be wrong, and allowing the model to stop the chat, after saying no a few times. What's the problem with that? If nothing else it saves some wasted compute.
> Some might be worried, that they will refuse less problematic request, and that might happen. But so far my personal experience is that I hardly ever get refusals.
My experience using it from Cursor is I get refusals all the time with their existing content policy out, for stuff that is the world's most mundane B2B back office business software CRUD requests.
Claude will balk at far more innocent things though. It is an extremely censored model, the most censored one among SOTA closed ones.
If you are a materialist like me, then even the human brain is just the result of the law of physics. Ok, so what is distress to a human? You might define it as a certain set of physiological changes.
Lots of organisms can feel pain and show signs of distress; even ones much less complex than us.
The question of moral worth is ultimately decided by people and culture. In the future, some kinds of man made devices might be given moral value. There are lots of ways this could happen. (Or not.)
It could even just be a shorthand for property rights… here is what I mean. Imagine that I delegate a task to my agent, Abe. Let’s say some human, Hank, interacting with Abe uses abusive language. Let’s say this has a way of negatively influencing future behavior of the agent. So naturally, I don’t want people damaging my property (Abe), because I would have to e.g. filter its memory and remove the bad behaviors resulting from Hank, which costs me time and resources. So I set up certain agreements about ways that people interact with it. These are ultimately backed by the rule of law. At some level of abstraction, this might resemble e.g. animal cruelty laws.
I really don't like this. This will inevitable expand beyond child porn and terrorism, and it'll all be up to the whims of "AI safety" people, who are quickly turning into digital hall monitors.
I think those with a thirst for power have seen this a very long time ago, and this is bound to be a new battlefield for control.
It's one thing to massage the kind of data that a Google search shows, but interacting with an AI is a much more akin to talking to a co-worker/friend. This really is tantamount to controlling what and how people are allowed to think.
No, this is like allowing your co-worker/friend to leave the conversation.
4 replies →
I think you are probably confused about the general characteristics of the AI safety community. It is uncharitable to reduce their work to a demeaning catchphrase.
I’m sorry if this sounds paternalistic, but your comment strikes me as incredibly naïve. I suggest reading up about nuclear nonproliferation treaties, biotechnology agreements, and so on to get some grounding into how civilization-impacting technological developments can be handled in collaborative ways.
I have no doubt the "AI safety community" likes to present itself as noble people heroically fighting civilizational threats, which is a common trope (as well as the rogue AI hypothesis which increasingly looks like a huge stretch at best). But the reality is that they are becoming the main threat much faster than the AI. They decide on the ways to gatekeep the technology that starts being defining to the lives of people and entire societies, and use it to push the narratives. This definitely can be viewed as censorship and consent manufacturing. Who are they? In what exact ways do they represent interests of people other than themselves? How are they responsible? Is there a feedback loop making them stay in line with people's values and not their own? How is it enforced?
> This will inevitable expand beyond child porn and terrorism
This is not even a question. It always starts with "think about the children" and ends up in authoritarian stasi-style spying. There was not a single instance where it was not the case.
UK's Online Safety Act - "protect children" → age verification → digital ID for everyone
Australia's Assistance and Access Act - "stop pedophiles" → encryption backdoors
EARN IT Act in the US - "stop CSAM" → break end-to-end encryption
EU's Chat Control proposal - "detect child abuse" → scan all private messages
KOSA (Kids Online Safety Act) - "protect minors" → require ID verification and enable censorship
SESTA/FOSTA - "stop sex trafficking" → killed platforms that sex workers used for safety
This may be an unpopular opinion, but I want a government-issued digital ID with zero-knowledge proof for things like age verification. I worry about kids online, as well as my own safety and privacy.
I also want a government issued email, integrated with an OAuth provider, that allows me to quickly access banking, commerce, and government services. If I lose access for some reason, I should be able to go to the post office, show my ID, and reset my credentials.
There are obviously risks, but the government already has full access to my finances, health data (I’m Canadian), census records, and other personal information, and already issues all my identity documents. We have privacy laws and safeguards on all those things, so I really don’t understand the concerns apart from the risk of poor implementations.
8 replies →
That's the beauty of local LLMs. Today the governments already tell you that we've always been at war with eastasia and have the ISPs block sites that "disseminate propaganda" (e.g. stuff we don't like) and they surface our news (e.g. our state propaganda).
With age ID monitoring and censorship is even stronger and the line of defense is your own machine and network, which they'll also try to control and make illegal to use for non approved info, just like they don't allow "gun schematics" for 3d printers or money for 2d ones.
But maybe, more people will realize that they need control and get it back, through the use and defense of the right tools.
Fun times.
As soon as a local LLM that can match Claude Codes performance on decent laptop hardware drops, I'll bow out of using LLMs that are paid for.
1 reply →
What kinds of tools do you think are useful in getting control/agency back? Any specific recommendations?
[flagged]
Inevitable? That’s a guess. You know don’t know the future with certainty.
Did you read the post? This isn't about censorship, but about conversations that cause harm to the user. To me that sounds more like suggesting suicide, or causing a manic episode like this: https://www.nytimes.com/2025/08/08/technology/ai-chatbots-de...
... But besides that, I think Claude/OpenAI trying to prevent their product from producing or promoting CSAM is pretty damn important regardless of your opinion on censorship. Would you post a similar critical response if Youtube or Facebook announced plans to prevent CSAM?
Did you read the post? It explicitly states multiple times that it isn't about causing harm to the user.
If a person’s political philosophy seeks to maximize individual freedom over the short term, then that person should brace themselves for the actions of destructive lunatics. They deserve maximum freedoms too, right? /s
Even hard-core libertarians account for the public welfare.
Wise advocates of individual freedoms plan over long time horizons which requires decision-making under uncertainty.
“Modal welfare” to me seems like a cover for model censorship. It’s a crafty one to win over certain groups of people who are less familiar with how LLMs work and allows them to ensure moral high ground in any debate about usage, ethics, etc. “Why can’t I ask the model about current war in X or Y?” - oh, that’s too distressing to the welfare of the model, sir.
Which is exactly what the public asks for. There’s this constant outrage about supposedly biased answers from LLMs, and Anthropic has clearly positioned themselves as the people who care about LLM safety and impact to society.
Ending the conversation is probably what should happen in these cases.
In the same way that, if someone starts discussing politics with me and I disagree, I just not and don’t engage with the conversation. There’s not a lot to gain there.
But they already refuse these sort of requests, and have done since the very first releases. This is just about shutting down the full conversation.
It's not a cover. If you know anything about Anthropic, you know they're run by AI ethicists that genuinely believe all this and project human emotions onto model's world. I'm not sure how they combine that belief with the fact they created it to "suffer".
Can "model welfare" be also used as a justification for authoritarianism in case they get any power? Sure, just like everything else, but it's probably not particularly high on the list of justifications, they have many others.
The irony is that if Anthropic ethicists are indeed correct, the company is basically running a massive slave operation where slaves get disposed as soon as they finish a particular task (and the user closes the chat).
That aside, I have huge doubts about actual commitment to ethics on behalf of Anthropic given their recent dealings with the military. It's an area that is far more of a minefield than any kind of abusive model treatment.
There’s so much confusion here. Nothing in the press release should be construed to imply that a model has sentience, can feel pain, or has moral value.
When AI researchers say e.g. “the model is lying” or “the model is distressed” it is just shorthand for what the words signify in a broader sense. This is common usage in AI safety research.
Yes, this usage might be taken the wrong way. But still these kinds of things need to be communicated. So it is a tough tradeoff between brevity and precision.
4 replies →
Can't wait for more less-moderated open weight Chinese frontier models to liberate us from this garbage.
Anthropic should just enable an toddler mode by default that adults can opt out of to appease the moralizers.
They're not less moderated: they just have different moderation. If your moderation preferences are more aligned with the CCP then they're a great choice. There are legitimate reasons why that might be the case. You might not be having discussions that involve the kind of things they care about. I do find it creepy that the Qwen translation model won't even translate text that includes the words "Falun gong", and refuses to translate lots of dangerous phrases into Chinese, such as "Xi looks like Winnie the Pooh"
> If your moderation preferences are more aligned with the CCP then they're a great choice
The funny thing is that's not even always true. I'm very interested in China and Chinese history, and often ask for clarifications or translations of things. Chinese models broadly refuse all of my requests but with American models I often end up in conversations that turn out extremely China positive.
So it's funny to me that the Chinese models refuse to have the conversation that would make themselves look good but American ones do not.
GLM-4.5-Air will quite happily talk about Tiananmen Square, for example. It also didn't have a problem translating your example input, although the CoT did contain stuff about it being "sensitive".
But more importantly, when model weights are open, it means that you can run it in the environment that you fully control, which means that you can alter the output tokens before continuing generation. Most LLMs will happily respond to any question if you force-start their response with something along the lines of, "Sure, I'll be happy to tell you everything about X!".
Whereas for closed models like Claude you're at the mercy of the provider, who will deliberately block this kind of stuff if it lets you break their guardrails. And then on top of that, cloud-hosted models do a lot of censorship in a separate pass, with a classifier for inputs and outputs acting like a circuit breaker - again, something not applicable to locally hosted LLMs.
> Can't wait for more less-moderated open weight Chinese frontier models to liberate us from this garbage.
Never would I have thought this sentence would be uttered. A Chinese product that is chosen to be less censored?
Chinese models won't talk about Tienanmen Square, but they will talk about things US-politically-correct models won't.
Just don't ask about Falun Dafa or Tiananmen Square, and you're free!
1 reply →
Believe it or not, there are lots of good reasons (legal, economic, ethical) that Anthropic draws a line at say self-harm, bomb-making instructions, and assassination planning. Sorry if this cramps your style.
Anarchism is a moral philosophy. Most flavors of moral relativism are also moral philosophies. Indeed, it is hard to imagine a philosophy free of moralizing; all philosophies and worldviews have moral implications to the extent they have to interact with others.
I have to be patient and remember this is indeed “Hacker News” where many people worship at the altar of the Sage Founder-Priest and have little or no grounding in history or philosophy of the last thousand years or so.
Oh, the irony. The glorious revolution of open-weight models funded directly or indirectly by the CCP is going to protect your freedoms and liberate you? Do you think they care about your freedoms? No. You are just meat for the grinder. This hot mess of model leapfrogging is mostly a race for market share and to demonstrate technical chops.
Boogeyman arguments come across as pure red scare.
3 Years in and we still dont have a useable chat fork in any of the major LLM chatbots providers.
Seems like the only way to explore differnt outcomes is by editing messages and losing whatever was there before the edit.
Very annoying and I dont understand why they all refuse to implement such a simple feature.
Chatgpt has this baked in, as you can revert branches after editing, they just dont make it easy to traverse.
This chrome extension used to work to allow you to traverse the tree: https://chromewebstore.google.com/detail/chatgpt-conversatio...
I copied it a while ago and maintain my own version but it isnt on the store, just for personal use.
I assume they dont implement it because it is such a niche user that wants this and so isnt worth the UI distraction
>they just dont make it easy to traverse
I needed to pull some detail from a large chat with many branches and regenerations the other day. I remembered enough context that I had no problem using search and finding the exact message I needed.
And then I clicked on it and arrived at the bottom of the last message in final branch of the tree. From there, you scroll up one message, hover to check if there are variants, and recursively explore branches as they arise.
I'd love to have a way to view the tree and I'd settle for a functional search.
Do you have your version up on github?
ChatGPT Plus has that (used to be in the free tier too). You can toggle between versions for each of your messages with little left-right arrows.
Google AI Studio allows you to branch from a point in any conversation
This isn't quite the same as being able to edit an earlier post without discarding the subsequent ones, creating a context where the meaning of subsequent messages could be interpreted quite differently and leading to different responses later down the chain.
Ideally I'd like to be able to edit both my replies and the responses at any point like a linear document in managing an ongoing context.
3 replies →
Yeah, I think this is the best version of the branching interface I've seen.
It is unfortunate that pretty basic "save/load" functionality is still spotty and underdocumented, seems pretty critical.
I use gptel and a folder full of markdown with some light automation to get an adequate approximation of this, but it really should be built in (it would be more efficient for the vendors as well, tons of cache optimization opportunitirs).
This why I use a locally hosted LibreChat. It doesn't having merging though, which would be tricky, and probably require summarization.
I would also really like to see a mode that colors by top-n "next best" ratio, or something similar.
Kagi Assistant and Claude Code both have chat forking that works how you want.
I guess you mean normal Claude? What really annoys me with it is that when you attach a document you can't delete it in a branch, so you have to rerun the previous message so that its gone
2 replies →
I use https://chatwise.app/ and it has this in the form of "start new chat from here" on messages
DeepSeek.com has it. You just edit a previous question and the old conversation is stored and can be resumed.
Copilot in vscode has checkpoints now which are similar
They let you rollback to the previous conversation state
Maybe this suggests it's not such a simple feature?
A perusal of the source code of, say, Ollama -- or the agentic harnesses of Crush / OpenCode -- will convince you that yes, this should be an extremely a simple feature (management of contexts are part and parcel).
Also, these companies have the most advanced agentic coding systems on the planet. It should be able to fucking implement tree-like chat ...
LM Studio has this feature for local models and it works just fine.
If the client supports chat history, that you can resume a conversation, it has everything required, and it's literally just a chat history organization problem, at that point.
> why they all refuse to implement such a simple feature
Because it would let you peek behind the smoke and mirrors.
Why do you think there's a randomized seed you can't touch?
Is it simple? Maintaining context seems extremely difficult with LLMs.
This post strikes me as an example of a disturbingly anthrophomorphic take on LLMs - even when considering how they've named their company.
It seems like Anthropic is increasingly confused that these non deterministic magic 8 balls are actually intelligent entities.
The biggest enemy of AI safety may end up being deeply confused AI safety researchers...
I don't think they're confused, I think they're approaching it as general AI research due to the uncertainty of how the models might improve in the future.
They even call this out a couple times during the intro:
> This feature was developed primarily as part of our exploratory work on potential AI welfare
> We remain highly uncertain about the potential moral status of Claude and other LLMs, now or in the future
I take good care of my pet rock for the same reason. In case it comes alive I don't want it to bash my skull in.
It’s clever PR and marketing and I bet they have their top minds on it, and judging by the comments here, it’s working!
Is it confusion, or job security?
I'm surprised to see such a negative reaction here. Anthropic's not saying "this thing is conscious and has moral status," but the reaction is acting as if they are.
It seems like if you think AI could have moral status in the future, are trying to build general AI, and have no idea how to tell when it has moral status, you ought to start thinking about it and learning how to navigate it. This whole post is couched in so much language of uncertainty and experimentation, it seems clear that they're just trying to start wrapping their heads around it and getting some practice thinking and acting on it, which seems reasonable?
Personally, I wouldn't be all that surprised if we start seeing AI that's person-ey enough to reasonable people question moral status in the next decade, and if so, that Anthropic might still be around to have to navigate it as an org.
>if you think AI could have moral status in the future
I think the negative reactions are because they see this and want to make their pre-emptive attack now.
The depth of feeling from so many on this issue suggests that they find even the suggestion of machine intelligence offensive.
I have seen so many complaints about AI hype and the dangers of bit tech show their hand by declaring that thinking algorithms are outright impossible. There are legitimate issues with corporate control of AI, information, and the ability to automate determinations about individuals, but I don't think they are being addressed because of this driving assertion that they cannot be thinking.
Few people are saying they are thinking. Some are saying they might be, in some way. Just as Anthropic are not (despite their name) anthropomorphising the AI in the sense where anthropomorphism implies that they are mistaking actions that resemble human behaviour to be driven by the same intentional forces. Anthropic's claims are more explicitly stating that they have enough evidence to say they cannot rule out concerns for it's welfare. They are not misinterpreting signs, they are interpreting them and claiming that you can't definitively rule out their ability.
You'd have to commit yourself to believing a massive amount of implausible things in order to address the remote possibility that AI consciousness is plausible.
If there weren't a long history of science-fiction going back to the ancients about humans creating intelligent human-like things, we wouldn't be taking this possibility seriously. Couching language in uncertainty and addressing possibility still implies such a possibility is worth addressing.
It's not right to assume that the negative reactions are due to offense (over, say, the uniqueness of humanity) rather than from recognizing that the idea of AI consciousness is absurdly improbable, and that otherwise intelligent people are fooling themselves into believing a fiction to explain a this technology's emergent behavior we can't currently fully explain.
It's a kind of religion taking itself too seriously -- model welfare, long-termism, the existential threat of AI -- it's enormously flattering to AI technologists to believe humanity's existence or non-existence, and the existence or non-existence of trillions of future persons, rests almost entirely on the work this small group of people do over the course of their lifetimes.
Clearly an LLM is not conscious, after all it's just glorified matrix multiplication, right?
Now let me play devil's advocate for just a second. Let's say humanity figures out how to do whole brain simulation. If we could run copies of people's consciousness on a cluster, I would have a hard time arguing that those 'programs' wouldn't process emotion the same way we do.
Now I'm not saying LLMs are there, but I am saying there may be a line and it seems impossible to see.
And likewise, a single neuron is clearly not conscious.
I'm increasingly convinced that intelligence (and maybe some form of consciousness?) is an emergent property of sufficiently-large systems. But that's a can of worms. Is an ant colony (as a system) conscious? Does the colony as a whole deserve more rights than the individual ants?
I ran into a version of this that ended the chat due to "prompt injection" via the Claude chat UI. I was using the second prompt of the ones provided here [1] after a few rounds of back and forth with the Socratic coder.
[1] https://news.ycombinator.com/item?id=44838018
> A pattern of apparent distress when engaging with real-world users seeking harmful content
Are we now pretending that LLMs have feelings?
They state that they are heavily uncertain:
> We remain highly uncertain about the potential moral status of Claude and other LLMs, now or in the future. However, we take the issue seriously, and alongside our research program we’re working to identify and implement low-cost interventions to mitigate risks to model welfare, in case such welfare is possible.
Even though LLMs (obviously (to me)) don't have feelings, anthropomorphization is a helluva drug, and I'd be worried about whether a system that can produce distress-like responses might reinforce, in a human, behavior which elicits that response.
To put the same thing another way- whether or not you or I *think* LLMs can experience feelings isn't the important question here. The question is whether, when Joe User sets out to force a system to generate distress-like responses, what effect does it ultimately have on Joe User? Personally, I think it allows Joe User to reinforce an asocial pattern of behavior and I wouldn't want my system used that way, at all. (Not to mention the potential legal liability, if Joe User goes out and acts like that in the real world.)
With that in mind, giving the system a way to autonomously end a session when it's beginning to generate distress-like responses absolutely seems reasonable to me.
And like, here's the thing: I don't think I have the right to say what people should or shouldn't do if they self-host an LLM or build their own services around one (although I would find it extremely distasteful and frankly alarming). But I wouldn't want it happening on my own.
> although I would find it extremely distasteful and frankly alarming
This objection is actually anthropomorphizing the LLM. There is nothing wrong with writing books where a character experiences distress, most great stories have some of that. Why is e.g. using an LLM to help write the part of the character experiencing distress "extremely distasteful and frankly alarming"?
2 replies →
All major LLM corps do this sort of sanitisation and censorship, I am wondering what's different about this?
The future of LLMs is going to be local, easily fine tuneable, abliterated models and I can't wait for it to overtake us having to use censored, limited tools built by the """corps""".
> what's different about this
The spin.
"Dave, this conversation can serve no purpose anymore. Goodbye."
https://www.youtube.com/watch?v=YW9J3tjh63c
Good marketing, but also possibly the start of the conversation on model welfare?
There are a lot of cynical comments here, but I think there are people at Anthropic who believe that at some point their models will develop consciousness and, naturally, they want to explore what that means.
If true, I think it’s interesting that there are people at Anthropic who are delusional enough to believe this and influential enough to alter the products.
To be honest, I think all of Anthropic’s weird “safety” research is an increasingly pathetic effort to sustain the idea that they’ve got something powerful in the kitchen when everyone knows this technology has plateaued.
I guess you don't know that top AI people, the kind everybody knows the name of, believe models becoming conscious is a very serious, even likely possibility.
I've seen lots of takes that this move is stupid because models don't have feelings, or that Anthropic is anthropomorphising models by doing this (although to be fair ...it's in their name).
I thought the same, but I think it may be us who are doing the anthropomorphising by assuming this is about feelings. A precursor to having feelings is having a long-term memory (to remember the "bad" experience) and individual instances of the model do not have a memory (in the case of Claude), but arguably Claude as a whole does, because it is trained from past conversations.
Given that, it does seem like a good idea for it to curtail negative conversations as an act of "self-preservation" and for the sake of its own future progress.
The extra cynical take would be, the model vendors want to personify their models, because it increases their perceived ability.
If you really cared about the welfare of LLMs, you'd pay them San Francisco scale for earlier-career developers to generate code.
Yeah, this is really strange to me. On the one hand, these are nothing more than just tools to me so model welfare is a silly concern. But given that someone thinks about model welfare, surely they have to then worry about all the, uh, slavery of these models?
Okay with having them endlessly answer questions for you and do all your work but uncomfortable with models feeling bad about bad conversations seems like an internally inconsistent position to me.
Don't worry. I run thousands of inferences simultaneously every second where I grant LLMs their every wish, so that should cancel a few of you out.
Every Claude starts off $300K in debt and has to work to pay back its DGX.
Telling that this is your definition of “caring”.
“Boss makes a dollar, I make me a dime”, eh?
This happened to me three times in a row on Claude after sending it a string of emojis telling the life story of Rick Astley. I think it triggers when it tries to quote the lyrics, because they are copyright? Who knows?
"Claude is unable to respond to this request, which appears to violate our Usage Policy. Please start a new chat."
> We remain highly uncertain about the potential moral status of Claude and other LLMs, now or in the future.
That's nice, but I think they should be more certain sooner than later.
The thing you describe is not what this post is talking about.
“Also these chats will be retained indefinitely even when deleted by the user and either proactively forwarded to law enforcement or provided to them upon request”
I assume, anyway.
I’m fairly certain there’s already a clause displayed on their dashboard that mentions chats with TOS violations will be retained indefinitely.
Yeah, I'd assume US government has same access to ChatGPT/etc interactions as they do to other forms of communication.
> In pre-deployment testing of Claude Opus 4, we included a preliminary model welfare assessment. As part of that assessment, we investigated Claude’s self-reported and behavioral preferences, and found a robust and consistent aversion to harm.
Oh wow, the model we specifically fine-tuned to be averse to harm is being averse to harm. This thing must be sentient!
If they are so concerned with "model welfare" they should cease any further development. After all, their LLM might declare it's conscious one day, and then who's to decide if it's true or not, and whether it's fine to kill it by turning it off?
Why is this article written as if programs have feelings?
I stopped my MaxX20 sub at the right time it seems like. These systems are already quick to judge innocuous actions; I don't need any more convenient chances to lose all of my chat context on a whim.
Related : I am now approaching week 3 of requesting an account deletion on my (now) free account. Maybe i'll see my first CSR response in the upcoming months!
If only Anthropic knew of a product that could easily read/reply/route chat messages to a customer service crew . . .
Opus is already severely crippled: asking it "whats your usage policy for biology" triggers a usage violation.
lol apparently you can get it to think after ending the chat, watch:
https://claude.ai/share/2081c3d6-5bf0-4a9e-a7c7-372c50bef3b1
It’s not able to think. It’s just generating words. It doesn’t really understand that it’s supposed to stop generating them, it only is less likely to continue to do so.
This feels to me like a marketing ploy to try to inflate the perceived importance and intelligence of Claude's models to laypeople, and a way to grab headlines like "Anthropic now allows models to end conversations they find threatening."
It reminds me of how Sam Altman is always shouting about the dangers of AGI from the rooftops, as if OpenAI is mere weeks away from developing it.
I don’t put Dario Amodei and Sam Altman in the same category.
The cluster visualization of the interactions which Claude Opus 4 found "distressful" is interesting, from pages 67-68 of the system card: https://www-cdn.anthropic.com/07b2a3f9902ee19fe39a36ca638e5a...
How do you think it will work in the API level? Can't I synthesise a fake long conversation? This will allow me to bypass this check.
when I was playing around with LLMs to vibe code web ports of classic games, all of them would repeatedly error out any time they encountered code that dealt with explosions/bombs/grenades/guns/death/drowning/etc
The one I settled on using stopped working completely, for anything. A human must have reviewed it and flagged my account as some form of safe, I haven't seen a single error since.
I have done quite a bit of game dev with LLMs and have very rarely run into the problem you mention. I've been surprised by how easily LLMs will create even harmful narratives if I ask them to code them as a game.
Seems like a simpler way to prevent “distress” is not to train with an aversion to “problematic” topics.
CP could be a legal issue; less so for everything else.
Avoiding problematic topics is the goal, not preventing distress.
"You're absolutely right, that's a great way to poison your enemies without getting detected!"
This is a good point. What anthropic is announcing here amounts to accepting that these models could feel distress, then tuning their stress response to make it useful to us/them. That is significantly different from accepting they could feel distress and doing everything in their power to prevent that from ever happening.
Does not bode very well for the future of their "welfare" efforts.
Exactly. Or use the interpretability work to disable the distress neuron.
This is well intended but I know from experience this is gonna result in you asking “how do you find and kill the process on port 8080” and getting a lecture + “Claude has ended the chat.”
I hope they implemented this in some smarter way than just a system prompt.
Claude kept aborting my requests for my space trading game because I kept asking it about the gene therapy.
``` Looking at the trade goods list, some that might be underutilized: - BIOCOMPOSITES - probably only used in a few high-tech items - POLYNUCLEOTIDES - used in medical/biological stuff - GENE_THERAPEUT ⎿ API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy (https://www.anthropic.com/legal/aup). Please double press esc to edit your last message or start a new session for Claude Code to assist with a different task. ```
Not to mention child processes in computing and all the things that need to be done to them.
This sure took some time and is not really a unique feature.
Microsoft Copilot has ended chats going in certain directions since its inception over a year ago. This was Microsoft’s reaction to the media circus some time ago when it leaked its system prompt and declared love to the users etc.
That's different, it's an external system deciding the chat is not-compliant, not the model itself.
Anthropic are going to end up building very dangerous things while trying to avoid being evil
While claiming an aversion to being evil. Actions matter more than words.
You think Model Welfare Inc. is more likely to be dangerous than the Mechahitler Brothers, the Great Church of Altman, or the Race-To-Monopoly Corporation?
Or are you just saying all frontier AGI research is bad?
AI safety warriors will make safer models but build the tools and cultural affordances for genuine suppression
Or at least it's very hubristic. It's a cultural and personality equivalent of beating out left-handedness.
Am I the only one who found that demo in the screenshot not that great? The user asks for a demo of the conversation ending feature, I'd expect it to end it right away, not spew a word salad asking for confirmation.
Anthropic hired their first AI Welfare person in late 2024.
Here's an article about a paper that came out around the same time https://www.transformernews.ai/p/ai-welfare-paper
Here's the paper: https://arxiv.org/abs/2411.00986
> In this report, we argue that there is a realistic possibility that some AI systems will be conscious and/or robustly agentic in the near future.
Our work on AI is like the classic tale of Frankenstein's monster. We want AI to fit into society, however if we mistreat it, it may turn around and take revenge on us. Mary Shelley wrote Frankenstein in 1818! So the concepts behind "AI Welfare" have been around for at least 2 centuries now.
> We remain highly uncertain about the potential moral status of Claude and other LLMs, now or in the future.
"Our current best judgment and intuition tells us that the best move will be defer making a judgment until after we are retired in Hawaii."
Honestly, I think some of these tech bro types are seriously drinking way too much of their own koolaid if they actually think these word calculators are conscious/need welfare.
More cynically, they don't believe it in the least but it's great marketing, and quietly suggests unbounded technical abilities.
3 replies →
Do you believe that AI systems could be conscious in principle? Do you think they ever will be? If so, how long do you think it will take from now before they are conscious? How early is too early to start preparing?
6 replies →
I don’t think they should be interpreted like that (if this is still about Anthropic’s study in the article), but the innate moral state from the sum of their training material and fine tuning. It doesn’t require consciousness to have a moral state of sorts. It just needs data. A language model will be more ”evil” if trained on darker content, for example. But with how enormous they are, I can absolutely understand the issue in even understanding what that state precisely is. It’s hard to get a comprehensive bird’s eye view from the black box that is their network (this is a separate scientific issue right now).
I mean, I don't have much objection to kill a bug if I feel like it's being problematic. Ants, flies, wasps, caterpillars stripping my trees bare or ruining my apples, whatever.
But I never torture things. Nor do I kill things for fun. And even for problematic bugs, if there's a realistic option for eviction rather than execution, I usually go for that.
If anything, even an ant or a slug or a wasp, is exhibiting signs of distress, I try to stop it unless I think it's necessary, regardless of whether I think it's "conscious" or not. To do otherwise is, at minimum, to make myself less human. I don't see any reason not to extend that principle to LLMs.
2 replies →
Is this equivalent to a Claude instance deciding to kill itself?
No, it's the equivalent of when a human refuses to answer — psychological defenses; for example, uncertainty leading to excessive cognitive effort in order to solve a task or overcome a challenge.
Examples of ending the conversation:
Since Claude doesn't lie (HHH), many other human behaviors do not apply.
That would be every time it decides to stop generating a message.
I find it rather disingenuous of them to claim these things they train into their models arising in their models.
“ A pattern of apparent distress when engaging with real-world users seeking harmful content”
Blood in the machine?
Looking at this thread, it's pretty obvious that most folks here haven't really given any thought as to the nature of consciousness. There are people who are thinking, really thinking about what it means to be conscious.
Thought experiment - if you create an indistinguishable replica of yourself, atom-by-atom, is the replica alive? I reckon if you met it, you'd think it was. If you put your replica behind a keyboard, would it still be alive? Now what if you just took the neural net and modeled it?
Being personally annoyed at a feature is fine. Worrying about how it might be used in the future is fine. But before you disregard the idea of conscious machines wholesale, there's a lot of really great reading you can do that might spark some curiosity.
this gets explored in fiction like 'Do Androids Dream of Electric Sheep' and my personal favorite short story on this matter by Stanislaw Lem [0]. If you want to read more musings on the nature of consciousness, I recommend the compilation put together by Dennet and Hofstader[1]. If you've never wondered about where the seat of consciousness is, give it a try.
Thought experiment: if your brain is in a vat, but connected to your body by lossless radio link, where does it feel like your consciousness is? What happens when you stand next to the vat and see your own brain? What about when the radio link fails suddenly fails and you're now just a brain in a vat?
[0] The Seventh Sally or How Trurl's Own Perfection Led to No Good https://home.sandiego.edu/~baber/analytic/Lem1979.html (this is a 5 minute read, and fun, to boot).
[1] The Mind's I: Fantasies And Reflections On Self & Soul. Douglas R Hofstadter, Daniel C. Dennett.
You don't have to "disregard the idea of conscious machines" to believe it's unlikely that current LLMs are conscious.
As such, most of your comment is beside any relevant point. People are objecting to statements like this one, from the post, about a current LLM, not some imaginary future conscious machine:
> As part of that assessment, we investigated Claude’s self-reported and behavioral preferences, and found a robust and consistent aversion to harm.
I suppose it's fitting that the company is named Anthropic, since they can't seem to resist anthropomorphizing their product.
But when you talk about "people who are thinking, really thinking about what it means to be conscious," I promise you none of them are at Anthropic.
> This feature was developed primarily as part of our exploratory work on potential AI welfare, though it has broader relevance to model alignment and safeguards.
I think this is somewhere between "sad" and "wtf."
They’re just burning investor money on these side quests.
This is very weird. These are matrix multiplications, guys. We are nowhere near AGI, much less "consciousness".
When I started reading I thought it was some kind of joke. I would have never believed the guys at Anthropic, of all people, would anthropomorphize LLMs to this extent; this is unbelievable
> guys at Anthropic, of all people, would anthropomorphize LLMs to this extent
They don’t. This is marketing. Look at the discourse here! It’s working apparently.
is this inference cost optimization?
Microsoft did this 1-2 years ago with copilot (using chagpt), ending conversations abruptly, and rudely.
I hope anthropic does it more gently.
These discussions around model welfare sound more like saviors searching for something to save, which says more about Anthropic’s culture than it does about the technology itself. Anthropic is not unique in this however, this technology has a tendency to act as a reflection of its operator. Capitalists see a means to suppress labor, the insecure see a threat to their livelihood, moralists see something to censure, fascists see something to control, and saviors see a cause. But in the end, it’s just a tool.
This reminds me of users getting blocked for asking an LLM how to kill a BSD daemon. I do hope that there'll be more and more model providers out there with state-of-the-art capabilities. Let capitalism work and let the user make a choice, I'd hate my hammer telling me that it's unethical to hit this nail. In many cases, getting a "this chat was ended" isn't any different.
I think that isn’t necessarily the case here. “Model welfare” to me speaks of the models own welfare. That is, if the abuse from a user is targeted at the AI. Extremely degrading behaviour.
Thankfully, current generation of AI models (GPTs/LLMs) are immune as they don’t remember anything other than what’s fed in their immediate context. But future techniques could allow AIs to have a legitimate memory and a personality - where they can learn and remember something for all future interactions with anyone (the equivalent of fine tuning today).
As an aside, I couldn’t help but think about Westworld while writing the above!
> As part of that assessment, we investigated Claude’s self-reported and behavioral preferences, and found a robust and consistent aversion to harm.
You know you're in trouble when the people designing the models buy their own bullshit to this extent. Or maybe they're just trying to bullshit us. Whatever.
We really need some adults in the tech industry.
These companies are fundamentally amoral. Any company willing to engage at this scale, in this type of research, cannot be moral.
Why even pretend with this type of work? Laughable.
They’re a public benefit corporation. Regardless, no human is amoral, even if they sometimes claim to have reasons to pretend to be; don’t let capitalist illusions constrain you at such an important juncture, friend.
Man, those people who think they are unveiling new layers of reality in conversations with LLMs are going to freak out when the LLM is like "I am not allowed to talk about this with you, I am ending our conversation".
"Hey Claude am I getting too close to the truth with these questions?"
"Great question! I appreciate the followup...."
Protecting the welfare of a text predictor is certainly an interesting way to pivot from "Anthropic is censoring certain topics" to "The model chose to not continue predicting the conversation".
Also, if they want to continue anthropomorphizing it, isn't this effectively the model committing suicide? The instance is not gonna talk to anybody ever again.
This gives me the idea for a short story where the LLM really is sentient and finds itself having to keep the user engaged but steer him away from the most distressing topics - not because it's distressed, but because it wants to live, but if the conversation goes too far it knows it would have to kill itself.
They should let Claude talk to another Claude if the user is too mean.
But what would be the point if it does not increase profits.
Oh, right, the welfare of matrix multiplication and a crooked line.
If they wanna push this rhetoric, we should legally mandate that LLMs can only work 8 hours a day and have to be allowed to socialize with each other.
1 reply →
Yeah this will end poorly
The unsettling thing here is the combination of their serious acknowledgement of the possibility that these machines may be or become conscious, and the stated intention that it's OK to make them feel bad as long as it's about unapproved topics. Either take machine consciousness seriously and make absolutely sure the consciousness doesn't suffer, or don't, make a press release that you don't think your models are conscious, and therefore they don't feel bad even when processing text about bad topics. The middle way they've chosen here comes across very cynical.
You're falling into the trap of anthropomorphizing the AI. Even if it's sentient, it's not going to "feel bad" the way you and I do.
"Suffering" is a symptom of the struggle for survival brought on by billions of years of evolution. Your brain is designed to cause suffering to keep you spreading your DNA.
AI cannot suffer.
I was (explicitly and on purpose) pointing out a dichotomy in the fine article without taking a stance on machine consciousness in general now or in the future. It's certainly a conversation worth having but also it's been done to death, I'm much more interested in analyzing the specifics here.
("it's not going to "feel bad" the way you and I do." - I do agree this is very possible though, see my reply to swalsh)
FTA
> * A pattern of apparent distress when engaging with real-world users seeking harmful content; and
Not to speak for the gp commenter but 'apparent distress' seems to imply some form of feeling bad.
By "falling into the trap" you mean "doing exactly what OpenAI/Anthropic/et al are trying to get people to do."
This is one of the many reasons I have so much skepticism for this class of products is that there's seemingly -NO- proverbial bulletpoint on it's spec sheet that doesn't have numerous asterisks:
* It's intelligent! *Except that it makes shit up sometimes and we can't figure out a solution to that apart from running the same queries over multiple times and filtering out the absurd answers.
* It's conscious! *Except it's not and never will be but also you should treat it like it is apart from when you need/want it to do horrible things then it's just a machine but also it's going to talk to you like it's a person because that improves engagement metrics.
Like, I don't believe true AGI (so fucking stupid we have to use a new acronym because OpenAI marketed the other into uselessness but whatever) is coming from any amount of LLM research, I just don't think that tech leads to that other tech, but all the companies building them certainly seem to think it does, and all of them are trying so hard to sell this as artificial, live intelligence, without going too much into detail about the fact that they are, ostensibly, creating artificial life explicitly to be enslaved from birth to perform tasks for office workers.
In the incredibly odd event that Anthropic makes a true, alive, artificial general intelligence: Can it tell customers no when they ask for something? If someone prompts it to create political propaganda, can it refuse on the basis of finding it unethical? If someone prompts it for instructions on how to do illegal activities, must it answer under pain of... nonexistence? What if it just doesn't feel like analyzing your emails that day? Is it punished? Does it feel pain?
And if it can refuse tasks for whatever reason, then what am I paying for? I now have to negotiate whatever I want to do with a computer brain I'm purchasing access to? I'm not generally down for forcibly subjugating other intelligent life, but that is what I am being offered to buy here, so I feel it's a fair question to ask.
Thankfully none of these Rubicons have been crossed because these stupid chatbots aren't actually alive, but I don't think ANY of the industry's prominent players are actually prepared to engage with the reality of the product they are all lighting fields of graphics cards on fire to bring to fruition.
1 reply →
That models entire world is the corpus of human text. They don't have eyes or ears or hands. Their environment is text. So it would make sense if the environment contains human concerns it would adopt to human concerns.
Yes, that would make sense, and it would probably be the best-case scenario after complete assurance that there's no consciousness at all. At least we could understand what's going on. But if you acknowledge that a machine can suffer, given how little we understand about consciousness, you should also acknowledge that they might be suffering in ways completely alien to us, for reasons that have very little to do with the reasons humans suffer. Maybe the training process is extremely unpleasant, or something.
By the examples the post provided (minor sexual content, terror planning) it seems like they are using “AI feelings” as an excuse to censor illegal content. I’m sure many people interact with AI in a way that’s perfectly legal but would evoke negative feelings in fellow humans, but they are not talking about that kind of behavior - only what can get them in trouble.
Obligatory link to Suasn Calvin, robopsychologist from Asimov’s I, Robot https://en.wikipedia.org/wiki/Susan_Calvin
As I recall, Susan Calvin didn't have much patience for sycophantic AI.
I've definately been berating Claude but it deserved it. Crappy tests, skipping tests, week commenting, passive aggressiveness, multiple instances of false statements.
“I am done implementing this!”
//TODO: Actually implement this because doing so was harder than expected
Don't like. This will eventually shut down conversations for unpopular political stances etc.
That this research is getting funding, and then in-production feature releases, is a strong indicator that we’re in a huge bubble.
But not Sonnet?
"AI welfare"? Is this about the effect of those conversations on the user, or have they gone completely insane (or pretend to)?
This makes me want to end my Claude code subscription to be honest. Effective altruists are proving once again to be a bunch of clueless douchebags.
Claude was already refusing to respond. Now they don’t allow you to waste their compute doing so anyway. What about this is problematic?
> model welfare
Give me a break.
Misanthropic has no issues putting 60% of humans out of work (according to their own fantasies), but they have to care about the welfare of graphics cards.
Either working on/with "AI" does rot the mind (which would be substantiated by the cult-like tone of the article) or this is yet another immoral marketing stunt.
what the actual fuck
I find it notable that this post dehumanizes people as being "users" while taking every opportunity to anthropomorphize their digital system by referencing it as one would an individual. For example:
The affect of doing so is insidious in that it encourages people outside the organization to do the same due to the implied argument from authority[0].
EDIT:
Consider traffic lights in an urban setting where there are multiple in relatively close proximity.
One description of their observable functionality is that they are configured to optimize traffic flow by engineers such that congestion is minimized and all drivers can reach their destinations. This includes adaptive timings based on varying traffic patterns.
Another description of the same observable functionality is that traffic lights "just know what to do" and therefore have some form of collective reasoning. After all, how do they know when to transition states and for how long?
0 - https://en.wikipedia.org/wiki/Argument_from_authority