Comment by wodenokoto
17 days ago
Temperature changes the distribution that is sampled, not if a distribution is sampled.
Temperature changes the softmax equation[1], not weather or not you are sampling from the softmax result or choosing the highest probability. IBM's documentation corroborates this, saying you need to set do_sample to True in order for the temperature to have any effect, e.g., T changes how we sample, not if we sample [2].
A similar discussion on openai forum also claim that the RNG might be in a different state from run to run, although I am less sure about that [3]
[1] https://pelinbalci.com/2023/10/16/Temperature_parameter.html
[2] https://www.ibm.com/think/topics/llm-temperature#:~:text=The...
[3] https://community.openai.com/t/clarifications-on-setting-tem...
I have dealt with traditional ML models in the past and things like tensorflow non-reproducibility. Managed to make them behave reproducibly. This is a very basic requirement. If we cannot even have that or people who deal with Gemini or similar models do not even know why they don't deliver reproducible results ... This seems very bad. It becomes outright unusable for anyone wanting to do research with reliable result. We already have a reproducibility crisis, because researchers often do not have the required knowledge to properly handle their tooling and would need a knowledgeable engineer to set it up. Only that most engineers don't know either and don't show enough attention to the detail to make reproducible software.
Your response is correct. However, you can choose to not sample from the distribution. You can have a rule to always choose the token with the highest probability generated by the softmax layer.
This approach should make the LLM deterministic regardless of the temperature chosen.
P.S. Choosing lower and lower temperatures will make the LLM more deterministic but it will never be totally deterministic because there will always be some probability in other tokens. Also it is not possible to use temperature as exactly 0 due to exp(1/T) blowup. Like I mentioned above, you could avoid fiddling with temperature and just decide to always choose token with highest probability for full determinism.
There are probably other more subtle things that might make the LLM non-deterministic from run to run though. It could be due to some non-deterministism in the GPU/CPU hardware. Floating point is very sensitive to ordering.
TL;DR for as much determinism as possible just choose token with highest probability (i.e. dont sample the distribution).