Comment by ACCount37

2 months ago

No. There's no "answer" really.

They use self-distillation to shift the output distribution of the model towards that of the same model, but running with different temperature/truncation settings in sampling.

This effectively "folds" the logit tail truncation behavior into the model itself.

Not entirely unlike a few "model controlled sampling settings" things I've seen in what it does, but different in execution.

Isn't that "scheduled sampling"? In that case they also shift the input distribution toward that of the model, which possibly is even more crucial than shifting the output distribution?