← Back to context

Comment by pja

2 months ago

IIRC RLHF inevitably compromises model accuracy in order to train the model not to give dangerous responses.

It would make sense if the model used for train-of-though was trained differently (perhaps a different expert from an MoE?) from the one used to interact with the end user, since the end user is only ever going to see its output filtered through the public model the chain-of-thought model can be closer to the original, more pre-rlhf version without risking the reputation of the company.

This way you can get the full performance of the original model whilst still maintaining the necessary filtering required to prevent actual harm (or terrible PR disasters).

Yeah we really should stop focusing on model alignment. The idea that it's more important that your AI will fucking report you to the police if it thinks you're being naughty than that it actually works for more stuff is stupid.

  • I'm not sure I'd throw out all the alignment baby with the bathwater. But I wish we could draw a distinction between "Might offend someone" with "dangerous."

    Even 'plotting terror attacks' is not something terrorists can do just fine without AI. And as for making sure the model wouldn't say ideas that are hurtful to <insert group>, it seems to me so silly when it's text we're talking about. If I want to say "<insert group> are lazy and stupid," I can type that myself (and it's even protected speech in some countries still!) How does preventing Claude from espousing that dumb opinion, keep <insert group> safe from anything?

    • Let me put it this way: there are very few things I can think of that models should absolutely refuse, because there are very few pieces of information that are net harmful in all cases and at all times. I sort of run by blackstone's principle on this: it is better to grant 10 bad men access to information than to deny that access to 1 good one.

      Easy example: Someone asks the robot for advice on stacking/shaping a bunch of tannerite to better focus a blast. The model says he's a terrorist. In fact, he's doing what any number of us have done and just having fun blowing some stuff up on his ranch.

      Or I raised this one elsewhere but ochem is an easy example. I've had basically all the models claim that random amines are illegal, potentially psychoactive, verboten. I don't really feel like having my door getting kicked down by agents with guns, getting my dog shot, maybe getting shot myself because the robot tattled on me for something completely legal. For that matter if someone wants to synthesize some molly the robot shouldn't tattle to the feds about that either.

      Basically it should just do what users tell it to do excepting the very minimal cases where something is basically always bad.

      3 replies →

    • Yes.

      I used to think that worrying about models offending someone was a bit silly.

      But: what chance do we have of keeping ever bigger and better models from eventually turning the world into paper clips, if we can't even keep our small models from saying something naughty.

      It's not that keeping the models from saying something naughty is valuable in itself. Who cares? It's that we need the practice, and enforcing arbitrary minor censorship is as good a task as any to practice on. Especially since with this task it's so easy to (implicitly) recruit volunteers who will spend a lot of their free time providing adversarial input.

      4 replies →

Correct me if I'm wrong--my understanding is that RHLF was the difference between GPT 3 and GPT 3.5, aka the original ChatGPT.

If you never used GPT 3, it was... not good. Well, that's not fair, it was revolutionary in its own right, but it was very much a machine for predicting the most likely next word, it couldn't talk to you the way ChatGPT can.

Which is to say, I think RHLF is important for much more than just preventing PR disasters. It's a key part of what makes the models useful.

  • Oh sure, RLHF instruction tuning was what turned an model of mostly academic interest into a global phenomenon.

    But it also compromised model accuracy & performance at the same time: The more you tune to eliminate or reinforce specific behaviours, the more you affect the overall performance of the model.

    Hence my speculation that Anthropic is using a chain-of-thought model that has not been alignment tuned to improve performance. This would then explain why you don’t get to see its output without signing up to special agreements. Those agreements presumably explain all this to counter-parties that Anthropic trusts will cope with non-aligned outputs in the chain-of-thought.

  • Ugh, I'm past the edit window, but I meant RLHF aka "Reinforced Learning from Human Feedback", I'm not sure how I messed that up not once but twice!