Comment by iandanforth

5 months ago

At temperature zero, if you're using the same API/model, this really should not be the case. None of the big players update their APIs without some name / version change.

24 comments

iandanforth

pigscantfly 5 months ago

This isn't really true unfortunately -- mixture of experts routing seems to suffer from batch non-determinism. No one has stated publicly exactly why this is, but you can easily replicate the behavior yourself or find bug reports / discussion with a bit of searching. The outcome and observed behavior of the major closed-weight LLM APIs is that a temperature of zero no longer corresponds to deterministic greedy sampling.

brookst 5 months ago
If temperature is zero, and weights are weights, where is the non-deterministic behavior coming from?
- wodenokoto 5 months ago
  
  Temperature changes the distribution that is sampled, not if a distribution is sampled.
  Temperature changes the softmax equation[1], not weather or not you are sampling from the softmax result or choosing the highest probability. IBM's documentation corroborates this, saying you need to set do_sample to True in order for the temperature to have any effect, e.g., T changes how we sample, not if we sample [2].
  A similar discussion on openai forum also claim that the RNG might be in a different state from run to run, although I am less sure about that [3]
  [1] https://pelinbalci.com/2023/10/16/Temperature_parameter.html
  [2] https://www.ibm.com/think/topics/llm-temperature#:~:text=The...
  [3] https://community.openai.com/t/clarifications-on-setting-tem...
  
  2 replies →
- TeMPOraL 5 months ago
  
  Here probably routing would be dominating, but in general, unless I missed all the vendors ditching GPUs and switching to ASICs optimized for fixed precision math, floating points are still non-commutative therefore results are non-deterministic wrt. randomness introduced by parallelising the calculations.
  
  9 replies →
- michalsustr 5 months ago
  
  I recently attended a STAC conference where they claimed the GPUs themselves are not deterministic. The hand-wavy speculation is they need to temperature control the cores and the flop ops may be reordered during that process. (By temperature I mean physical temperature, not some nn sampling parameter). On such large scale of computation these small differences can show up in the actually different tokens.
  
  1 reply →
- petesergeant 5 months ago
  
  The parent is suggesting that temperature only applies at the generation step, but the choice of backend “expert model” that a request is given to (and then performs the generation) is non-deterministic. Rather than being a single set of weights, there are a few different sets of weights that constitute the “expert” in MoE. I have no idea if that’s true, but that’s the assertion
  
  4 replies →
- daralthus 5 months ago
  
  I have seen numbers come differently in JAX just depending on the batch size, simply because the compiler optimizes to a different sequence of operations on the hardware.

kiratp 5 months ago

Quantized floating point math can, under certain scenarios, be non-associative.

When you combine that fact with being part of a diverse batch of requests over an MoE model, outputs are non-deterministic.