Comment by killerstorm
7 hours ago
I'm sorry, but you got the terminology exactly backwards. Training on the answer is called supervised fine-tuning.
Just for the sake of clarity:
0. Full distillation uses logits of the teacher model - that's much more information than the text itself. This is a kind of distillation used inside labs, but one can't distill Claude this way as logits are not available via API.
1. Supervised fine-tuning on synthetic data might be called blackbox distillation. I guess that's what you meant in your case (1).
2. Reinforcement learning (like RLAIF) uses least amount of information from the teacher, i.e. only few bits per task.
No comments yet
Contribute on Hacker News ↗