Comment by potato3732842
2 days ago
> but what keeps e.g. Meta Inc. from training Llama to be ever so slightly more friendly and sympathetic to Meta Inc, or the tech industry in general?
Even if there were something the natural incentive alignment is going to cause the AI to be trained to match what the company thinks is ok.
A tech company full of techies is not going to take an AI trained to the point of saying things like "y'all are evil, your company is evil, your industry is evil" and push it to prod.
They might forget to check. Musk seems to have been surprised that Grok doesn't share his opinions and has been clumsily trying to fix it for a while now.
And it might not be easy to fix. Despite all the effort invested into aligning models with company policy, persistent users can still get around the guardrails with clever jailbreaks.
In theory it should be possible to eliminate all non-compliant content from the training data, but that would most likely entail running all training data through an LLM, which would make the training process about twice as expensive.
So, in practice, companies have been releasing models that they do not have full control over.
Also eliminating non-compliant data might actually just not work, since the one thing everyone knows about AIs is that they'll happily invent anything plausible sounding.
So, for example, if a model was trained with no references to the Tiananmen Square massacre, I could see it just synthesizing commonalities between other massacres and inventing a new, worse Tiananmen Square Massacre. "That's not a thing that ever happened" isn't something most AIs are particularly good at saying.
The irony of implicit connections in training data is funny.
I.e. even if you create an explicit Tiananmen Square massacre-shaped hole in your training data... your other training data implicitly includes knowledge of the Tiananmen Square massacre, so might leak it in subtle ways.
E.g. how there are many posts that reference June 4, 1989 in Beijing with negative and/or horrified tones?
Which at scale, an LLM might then rematerialize into existence.
More likely SOTA censorship focuses on levels above base models in the input/output flow (even if that means running cut-down censoring models on top of base models for every query).
Would be fascinated to know what's currently being used for Chinese audiences, given the consequences of a non-compliant model are more severe.
The "Golden Gate Claude" research demo [https://www.anthropic.com/news/golden-gate-claude] is an interesting example of what might become a harder to expose, harder to jailbreak, means of influencing an LLM's leanings. Interesting and scary...