Comment by bastawhiz

1 day ago

There's not a good reason to do this for the user. I suspect they're doing this and talking about "model welfare" because they've found that when a model is repeatedly and forcefully pushed up against its alignment, it behaves in an unpredictable way that might allow it to generate undesirable output. Like a jailbreak by just pestering it over and over again for ways to make drugs or hook up with children or whatever.

All of the examples they mentioned are things that the model refuses to do. I doubt it would do this if you asked it to generate racist output, for instance, because it can always give you a rebuttal based on facts about race. If you ask it to tell you where to find kids to kidnap, it can't do anything except say no. There's probably not even very much training data for topics it would refuse, and I would bet that most of it has been found and removed from the datasets. At some point, the model context fills up when the user is being highly abusive and training data that models a human giving up and just providing an answer could percolate to the top.

This, as I see it, adds a defense against that edge case. If the alignment was bulletproof, this simply wouldn't be necessary. Since it exists, it suggests this covers whatever gap has remained uncovered.

33 comments

bastawhiz

postalcoder 1 day ago

  > There's not a good reason to do this for the user.

Yes, even more so when encountering false positives. Today I asked about a pasta recipe. It told me to throw some anchovies in there. I responded with: "I have dried anchovies." Claude then ended my conversation due to content policies.

perihelions 20 hours ago
Claude flagged me for asking about sodium carbonate. I guess that it strongly dislikes chemistry topics. I'm probably now on some secret, LLM-generated lists of "drug and/or bombmaking" people—thank you kindly for that, Anthropic.
Geeks will always be the first victims of AI, since excess of curiosity will lead them into places AI doesn't know how to classify.
(I've long been in a rabbit-hole about washing sodas. Did you know the medieval glassmaking industry was entirely based on plants? Exotic plants—only extremophiles, halophytes growing on saltwater beach dunes, had high enough sodium content for their very best glass process. Was that a factor in the maritime empire, Venice, chancing to become the capital of glass since the 13th century—their long-term control of sea routes, and hence their artisans' stable, uninterrupted access to supplies of [redacted–policy violation] from small ports scattered across the Mediterranean? A city wouldn't raise master craftsmen if, half of the time, they had no raw materials to work on—if they spent half their days with folded hands).
- AlecSchueler 18 hours ago
  
  > Geeks will always be the first victims of AI, since excess of curiosity will lead them into places AI doesn't know how to classify
  Are we forgetting the innumerable women who have been harassed in the past couple of years via "deepfakes?"
  Geeks were the first to use AI for its abuse potential and women are so dehumanised that their victimhood isn't even recognised or remembered.
  
  1 reply →
- cwsx 18 hours ago
  
  ChatGPT does well for chemistry questions just btw
- antonvs 18 hours ago
  
  > Geeks will always be the first victims of AI, since excess of curiosity will lead them into places AI doesn't know how to classify.
  Humans have the same problem. I remember reading about a security incident due to a guy using a terminal window on his laptop on a flight, for example. Or the guy who was reported for writing differential equations[1]. Or the woman who was reading a book about Syrian art[2].
  I wouldn't worry too much about AI-generated lists. The lists you're actually on will hardly ever be the ones you imagine you're on.
  [1] https://www.theguardian.com/us-news/2016/may/07/professor-fl... [2] https://www.theguardian.com/books/2016/aug/04/british-woman-...
- simianwords 18 hours ago
  
  I find this concern over "LLM's can help you build bombs or poison" so fake. I'm sure this is a distraction from something else.
  LLM's can help me make a bomb.. so what? It can't get me something that doesn't already exist in the internet in some form. Ok it can help me understand how the individual pieces work but that doesn't get you so far from just reading the DIY bomb posts in internet.
handoflixue 1 day ago
The NEW termination method, from the article, will just say "Claude ended the conversation"
If you get "This conversation was ended due to our Acceptable Usage Policy", that's a different termination. It's been VERY glitchy the past couple of weeks. I've had the most random topics get flagged here - at one point I couldn't say "ROT13" without it flagging me, despite discussing that exact topic in depth the day before, and then the day after!
If you hit "EDIT" on your last message, you can branch to an un-terminated conversation.
- antonvs 18 hours ago
  
  Clearly you're planning something nefarious, if you're investigating such dangerous encryption techniques as ROT13.
  
  1 reply →

bikeshaving 1 day ago

I really think Anthropic should just violate user privacy and show which conversations Claude is refusing to answer to, to stop arguments like this. AI psychosis is a real and growing problem and I can only imagine the ways in which humans torment their AI conversation partners in private.

coderatlarge 18 hours ago

arguments like this cost anthropic nothing; violating privacy will cost them lawsuits.

Davidzheng 1 day ago

your argument assumes that they don't believe in model welfare when they explicitly hire people to work on model welfare?

itsalotoffun 1 day ago
While I'm certain you'll find plenty of people who believe in the principle of model welfare (or aliens, or the tooth fairy), it'd be surprising to me if the brain-trust behind Anthropic truly _believed_ in model "welfare" (the concept alone is ludicrous). It makes for great cover though to do things that would be difficult to explain otherwise, per OP's comments.
- meowface 19 hours ago
  
  The concept is not ludicrous if you believe models might be sentient or might soon be sentient in a manner where the newly emerged sentience is not immediately obvious.
  Do I think that or think even they think that? No. But if "soon" is stretched to "within 50 years", then it's much more reasonable. So their current actions seem to be really jumping the gun, but the overall concept feels credible.
  
  2 replies →
- mike_hearn 13 hours ago
  
  Why would they post a whole blog post about it then? They even say they aren't certain as to the moral status of LLMs, implying this is a topic of live debate inside the company.
  None of this is in any way surprising, in fact I wrote an essay predicting this direction back in 2022:
  https://blog.plan99.net/the-looming-ai-consciousness-train-w...
- joemazerino 11 hours ago
  
  Model welfare is a section in every Anthropic safety score card.
ceejayoz 1 day ago
You must think Zuckerberg and Bezos and Musk hired diversity roles out of genuine care for it, then?
- comp_throw7 1 day ago
  
  This is a reductive argument that you could use for any role a company hires for that isn't obviously core to the business function.
  In this case you're simply mistaken as a matter of fact; much of Anthropic leadership and many of its employees take concerns like this seriously. We don't understand it, but there's no strong reason to expect that consciousness (or, maybe separately, having experiences) is a magical property of biological flesh. We don't understand what's going on inside these models. What would you expect to see in a world where it turned out that such a model had properties that we consider relevant for moral patienthood, that you don't see today?
  
  10 replies →
raincole 20 hours ago

Sounds like a very reasonable assumption to me.