← Back to context

Comment by iandanforth

17 days ago

At temperature zero, if you're using the same API/model, this really should not be the case. None of the big players update their APIs without some name / version change.

This isn't really true unfortunately -- mixture of experts routing seems to suffer from batch non-determinism. No one has stated publicly exactly why this is, but you can easily replicate the behavior yourself or find bug reports / discussion with a bit of searching. The outcome and observed behavior of the major closed-weight LLM APIs is that a temperature of zero no longer corresponds to deterministic greedy sampling.

  • If temperature is zero, and weights are weights, where is the non-deterministic behavior coming from?

    • Temperature changes the distribution that is sampled, not if a distribution is sampled.

      Temperature changes the softmax equation[1], not weather or not you are sampling from the softmax result or choosing the highest probability. IBM's documentation corroborates this, saying you need to set do_sample to True in order for the temperature to have any effect, e.g., T changes how we sample, not if we sample [2].

      A similar discussion on openai forum also claim that the RNG might be in a different state from run to run, although I am less sure about that [3]

      [1] https://pelinbalci.com/2023/10/16/Temperature_parameter.html

      [2] https://www.ibm.com/think/topics/llm-temperature#:~:text=The...

      [3] https://community.openai.com/t/clarifications-on-setting-tem...

      2 replies →

    • Here probably routing would be dominating, but in general, unless I missed all the vendors ditching GPUs and switching to ASICs optimized for fixed precision math, floating points are still non-commutative therefore results are non-deterministic wrt. randomness introduced by parallelising the calculations.

      9 replies →

    • I recently attended a STAC conference where they claimed the GPUs themselves are not deterministic. The hand-wavy speculation is they need to temperature control the cores and the flop ops may be reordered during that process. (By temperature I mean physical temperature, not some nn sampling parameter). On such large scale of computation these small differences can show up in the actually different tokens.

      1 reply →

    • The parent is suggesting that temperature only applies at the generation step, but the choice of backend “expert model” that a request is given to (and then performs the generation) is non-deterministic. Rather than being a single set of weights, there are a few different sets of weights that constitute the “expert” in MoE. I have no idea if that’s true, but that’s the assertion

      4 replies →

    • I have seen numbers come differently in JAX just depending on the batch size, simply because the compiler optimizes to a different sequence of operations on the hardware.

Quantized floating point math can, under certain scenarios, be non-associative.

When you combine that fact with being part of a diverse batch of requests over an MoE model, outputs are non-deterministic.