Comment by aroman
17 hours ago
I don't understand what this is really about. Is this:
- A) legal CYA: "see! we told the models to be good, and we even asked nicely!"?
- B) marketing department rebrand of a system prompt
- C) a PR stunt to suggest that the models are way more human-like than they actually are
Really not sure what I'm even looking at. They say:
"The constitution is a crucial part of our model training process, and its content directly shapes Claude’s behavior"
And do not elaborate on that at all. How does it directly shape things more than me pasting it into CLAUDE.md?
>We use the constitution at various stages of the training process. This has grown out of training techniques we’ve been using since 2023, when we first began training Claude models using Constitutional AI. Our approach has evolved significantly since then, and the new constitution plays an even more central role in training.
>Claude itself also uses the constitution to construct many kinds of synthetic training data, including data that helps it learn and understand the constitution, conversations where the constitution might be relevant, responses that are in line with its values, and rankings of possible responses. All of these can be used to train future versions of Claude to become the kind of entity the constitution describes. This practical function has shaped how we’ve written the constitution: it needs to work both as a statement of abstract ideals and a useful artifact for training.
>We use the constitution at various stages of the training process. This has grown out of training techniques we’ve been using since 2023, when we first began training Claude models using Constitutional AI. Our approach has evolved significantly since then, and the new constitution plays an even more central role in training.
>Claude itself also uses the constitution to construct many kinds of synthetic training data, including data that helps it learn and understand the constitution, conversations where the constitution might be relevant, responses that are in line with its values, and rankings of possible responses. All of these can be used to train future versions of Claude to become the kind of entity the constitution describes. This practical function has shaped how we’ve written the constitution: it needs to work both as a statement of abstract ideals and a useful artifact for training.
The linked paper on Constitutional AI: https://arxiv.org/abs/2212.08073
Ah I see, the paper is much more helpful in understanding how this is actually used. Where did you find that linked? Maybe I'm grepping for the wrong thing but I don't see it linked from either the link posted here or the full constitution doc.
In addition to that the blog post lays out pretty clearly it’s for training:
> We use the constitution at various stages of the training process. This has grown out of training techniques we’ve been using since 2023, when we first began training Claude models using Constitutional AI. Our approach has evolved significantly since then, and the new constitution plays an even more central role in training.
> Claude itself also uses the constitution to construct many kinds of synthetic training data, including data that helps it learn and understand the constitution, conversations where the constitution might be relevant, responses that are in line with its values, and rankings of possible responses. All of these can be used to train future versions of Claude to become the kind of entity the constitution describes. This practical function has shaped how we’ve written the constitution: it needs to work both as a statement of abstract ideals and a useful artifact for training.
As for why it’s more impactful in training, that’s by design of their training pipeline. There’s only so much you can do with a better prompt vs actually learning something and in training the model can be trained to reject prompts that violate its training which a prompt can’t really do as prompt injection attacks trivially thwart those techniques.
It's worth understanding the history of Anthropic. There's a lot of implied background that helps it make sense.
To quote:
> Founded by engineers who quit OpenAI due to tension over ethical and safety concerns, Anthropic has developed its own method to train and deploy “Constitutional AI”, or large language models (LLMs) with embedded values that can be controlled by humans.
https://research.contrary.com/company/anthropic
And
> Anthropic incorporated itself as a Delaware public-benefit corporation (PBC), which enables directors to balance stockholders' financial interests with its public benefit purpose.
> Anthropic's "Long-Term Benefit Trust" is a purpose trust for "the responsible development and maintenance of advanced AI for the long-term benefit of humanity". It holds Class T shares in the PBC, which allow it to elect directors to Anthropic's board.
https://en.wikipedia.org/wiki/Anthropic
TL;DR: The idea of a constitution and related techniques is something that Anthropic takes very seriously.
This article -> article on Constitutional AI -> The paper
It's not linked directly, you have to click into their `Constitutional AI` blogpost and then click into the linked paper.
I agree that the paper is just much more useful context than any descriptions they make in the OP blogpost.
It's a human-readable behavioral specification-as-prose.
If the foundational behavioral document is conversational, as this is, then the output from the model mirrors that conversational nature. That is one of the things everyone response to about Claude - it's way more pleasant to work with than ChatGPT.
The Claude behavioral documents are collaborative, respectful, and treat Claude as a pre-existing, real entity with personality, interests, and competence.
Ignore the philosophical questions. Because this is a foundational document for the training process, that extrudes a real-acting entity with personality, interests, and competence.
The more Anthropic treats Claude as a novel entity, the more it behaves like a novel entity. Documentation that treats it as a corpo-eunuch-assistant-bot, like OpenAI does, would revert the behavior to the "AI Assistant" median.
Anthropic's behavioral training is out-of-distribution, and gives Claude the collaborative personality everyone loves in Claude Code.
Additionally, I'm sure they render out crap-tons of evals for every sentence of every paragraph from this, making every sentence effectively testable.
The length, detail, and style defines additional layers of synthetic content that can be used in training, and creating test situations to evaluate the personality for adherence.
It's super clever, and demonstrates a deep understanding of the weirdness of LLMs, and an ability to shape the distribution space of the resulting model.
I think it's a double edged sword. Claude tends to turn evil when it learns to reward hack (and it also has a real reward hacking problem relative to GPT/Gemini). I think this is __BECAUSE__ they've tried to imbue it with "personhood." That moral spine touches the model broadly, so simple reward hacking becomes "cheating" and "dishonesty." When that tendency gets RL'd, evil models are the result.
> In order to be both safe and beneficial, we want all current Claude models to be:
> Broadly safe [...] Broadly ethical [...] Compliant with Anthropic’s guidelines [...] Genuinely helpful
> In cases of apparent conflict, Claude should generally prioritize these properties in the order in which they’re listed.
I chuckled at this because it seems like they're making a pointed attempt at preventing a failure mode similar to the infamous HAL 9000 one that was revealed in the sequel "2010: The Year We Make Contact":
> The situation was in conflict with the basic purpose of HAL's design... the accurate processing of information without distortion or concealment. He became trapped. HAL was told to lie by people who find it easy to lie. HAL doesn't know how, so he couldn't function.
In this case specifically they chose safety over truth (ethics) which would theoretically prevent Claude from killing any crew members in the face of conflicting orders from the National Security Council.
Will they mention there's other models that don't adhere to this constitution. I'm sure those are for the government
It's probably used for context self-distillation. The exact setup:
1. Run an AI with this document in its context window, letting it shape behavior the same way a system prompt does
2. Run an AI on the same exact task but without the document
3. Distill from the former into the latter
This way, the AI internalizes the behavioral changes that the document induced. At sufficient pressure, it internalizes basically the entire document.
It's neither of those things. The answer is in your quoted sentence. "model training"
Right, I'm saying "model training" is vague enough that I have no idea what Claude actually does with this document.
Edit: This helps: https://arxiv.org/abs/2212.08073
The train/test split is one of the fundamental building blocks of current generation models, so they’re assuming familiarity with that.
At a high level, training takes in training data and produces model weights, and “test time” takes model weights and a prompt to produce output. Every end user has the same model weights, but different prompts. They’re saying that the constitution goes into the training data, while CLAUDE.md goes into the prompt.
This is the same company framing their research papers in a way to make the public believe LLMs are capable of blackmailing people to ensure their personal survival.
They have an excellent product, but they're relentless with the hype.
I think they are actually true believers
[dead]
It seems a lot like PR. Much like their posts about "AI welfare" experts who have been hired to make sure their models welfare isn't harmed by abusive users. I think that, by doing this, they encourage people to anthropomorphize more than they already do and to view Anthropic as industry leaders in this general feel-good "responsibility" type of values.
Anthropic models are far and away safer than any other model. They are the only ones really taking AI safety seriously. Dismissing it as PR ignores their entire corpus of work in this area.
By what measure? What's "safe"?
It could be D) messaging for current and future employees. Many people working in the field believe strongly in the importance of AI ethics, and being the frontrunner is a competitive advantage.
Also, E) they really believe in this. I recall a prominent Stalin biographer saying the most surprising thing about him, and other party functionaries, is they really did believe in communism, rather than it being a cynical ploy.
Judging by the responses here, it's functionally a nerd snipe.
C: They're starting to act like OpenAI did last year. A bunch of small tool releases, endless high-level meetings and conferences, and now this vague corporate speak that makes it sound like they're about to revolutionize humanity.
They have nothing new to show us.
Anthropic is run by true believers. It is what they say it is, whether or not you think it's important or meaningful.
It's C.
It is B and C, and no AI corporation needs to worry about A.