Comment by nok22kon

11 hours ago

its a bad idea in general to use non-1.0 temperature. there is a reason labs are strongly recommending using 1.0.

using low temperature is more deterministic, but the cost is the model becomes "dumber"

10 comments

nok22kon

tipsytoad 11 hours ago

1.0 is actually pretty arbitrary and way too high as a general rule. Something like 0.3 is a more sensible default

programjames 4 hours ago
1.0 is "natural units". If your energy corresponds to nats, you should be using temperature 1.0. If your energy corresponds to bits, you should be using temperature ln(2) ~= 0.7. The optimization pressure is
max nats = max entropy + energy / temperature
Why might energy correspond to bits or nats? Imagine your goal is to play as many interesting games of chess as possible in a tournament. This implies you have to keep winning. If you look at the RL environment from the right perspective, you can turn it into optimizing bits or nats.
317070 9 hours ago

If RL was used to train the model, the model will have been trained on its own sequences. Those will have been generated with a temperature of 1.0. They must be, otherwise you would get a premature collapse or explosion of your entropy if the temperature was respectively lower or higher.
After that RL step, you want to stick to the RL distribution, and so keep a temperature of 1.0. Other temperatures will drive the model out-of-distribution.
That is why the sampling step for agents or thinking LLMs are usually kept at a temperature of 1.0.
zipy124 10 hours ago

It really depends on the application does it not? I'm not an LLM guy, but for creative tasks like storytelling wouldn't you want a higher temperature usually? Happy to gain insight from anyone with experience here :)
embedding-shape 11 hours ago

Heavily depends on the model architecture and the implementation though, I don't think you can say what values are better than others without first specifying those, otherwise it's straight up guessing, ironically.
nullc 9 hours ago

If you use a model in a configuration far from where it was RLed you get no warranty. (you also get no warranty the other way, however)

jldugger 4 hours ago

Would 1.0 have fixed the wide variance in scoring?

nok22kon 1 hour ago

temperature is the wrong tool
the variance is caused by the bad evaluation prompt
if you ask "what is the capital of Paris" you'll always get Paris, with any (non-extreme) temperature

codeflo 11 hours ago

It can be useful for pure translation tasks and stuff like that where you explicitly don't want creativity of any kind.

vidarh 10 hours ago

Plenty of setups defaults to lower values than 1.0.