Comment by gr3ml1n

1 year ago

Your description of distillation is largely correct, but not RLHF.

The process of taking a base model that is capable of continuing ('autocomplete') some text input and teaching it to respond to questions in a Q&A chatbot-style format is called instruction tuning. It's pretty much always done via supervised fine-tuning. Otherwise known as: show it a bunch of examples of chat transcripts.

RLHF is more granular and generally one of the last steps in a training pipeline. With RLHF you train a new model, the reward model.

You make that model by having the LLM output a bunch of responses, and then having humans rank the output. E.g.:

  Q: What's the Capital of France? A: Paris

Might be scored as `1` by a human, while:

  Q: What's the Capital of France? A: Fuck if I know

Would be scored as `0`.

You feed those rankings into the reward model. Then, you have the LLM generate a ton of responses, and have the reward model score it.

If the reward model says it's good, the LLM's output is reinforced, i.e.: it's told 'that was good, more like that'.

If the output scores low, you do the opposite.

Because the reward model is trained based on human preferences, and the reward model is used to reinforce the LLMs output based on those preferences, the whole process is called reinforcement learning from human feedback.