Comment by iandanforth
17 days ago
At temperature zero, if you're using the same API/model, this really should not be the case. None of the big players update their APIs without some name / version change.
17 days ago
At temperature zero, if you're using the same API/model, this really should not be the case. None of the big players update their APIs without some name / version change.
This isn't really true unfortunately -- mixture of experts routing seems to suffer from batch non-determinism. No one has stated publicly exactly why this is, but you can easily replicate the behavior yourself or find bug reports / discussion with a bit of searching. The outcome and observed behavior of the major closed-weight LLM APIs is that a temperature of zero no longer corresponds to deterministic greedy sampling.
If temperature is zero, and weights are weights, where is the non-deterministic behavior coming from?
Temperature changes the distribution that is sampled, not if a distribution is sampled.
Temperature changes the softmax equation[1], not weather or not you are sampling from the softmax result or choosing the highest probability. IBM's documentation corroborates this, saying you need to set do_sample to True in order for the temperature to have any effect, e.g., T changes how we sample, not if we sample [2].
A similar discussion on openai forum also claim that the RNG might be in a different state from run to run, although I am less sure about that [3]
[1] https://pelinbalci.com/2023/10/16/Temperature_parameter.html
[2] https://www.ibm.com/think/topics/llm-temperature#:~:text=The...
[3] https://community.openai.com/t/clarifications-on-setting-tem...
2 replies →
Here probably routing would be dominating, but in general, unless I missed all the vendors ditching GPUs and switching to ASICs optimized for fixed precision math, floating points are still non-commutative therefore results are non-deterministic wrt. randomness introduced by parallelising the calculations.
9 replies →
I recently attended a STAC conference where they claimed the GPUs themselves are not deterministic. The hand-wavy speculation is they need to temperature control the cores and the flop ops may be reordered during that process. (By temperature I mean physical temperature, not some nn sampling parameter). On such large scale of computation these small differences can show up in the actually different tokens.
1 reply →
The parent is suggesting that temperature only applies at the generation step, but the choice of backend “expert model” that a request is given to (and then performs the generation) is non-deterministic. Rather than being a single set of weights, there are a few different sets of weights that constitute the “expert” in MoE. I have no idea if that’s true, but that’s the assertion
4 replies →
I have seen numbers come differently in JAX just depending on the batch size, simply because the compiler optimizes to a different sequence of operations on the hardware.
Quantized floating point math can, under certain scenarios, be non-associative.
When you combine that fact with being part of a diverse batch of requests over an MoE model, outputs are non-deterministic.