← Back to context

Comment by 317070

11 hours ago

If RL was used to train the model, the model will have been trained on its own sequences. Those will have been generated with a temperature of 1.0. They must be, otherwise you would get a premature collapse or explosion of your entropy if the temperature was respectively lower or higher.

After that RL step, you want to stick to the RL distribution, and so keep a temperature of 1.0. Other temperatures will drive the model out-of-distribution.

That is why the sampling step for agents or thinking LLMs are usually kept at a temperature of 1.0.