Comment by killerstorm

8 hours ago

I'm sorry, but you got the terminology exactly backwards. Training on the answer is called supervised fine-tuning.

Just for the sake of clarity:

0. Full distillation uses logits of the teacher model - that's much more information than the text itself. This is a kind of distillation used inside labs, but one can't distill Claude this way as logits are not available via API.

1. Supervised fine-tuning on synthetic data might be called blackbox distillation. I guess that's what you meant in your case (1).

2. Reinforcement learning (like RLAIF) uses least amount of information from the teacher, i.e. only few bits per task.

1 comment

killerstorm

0xbadcafebee 3 minutes ago

[delayed]