Comment by aesthesia

12 hours ago

A distribution with all probability mass on one outcome is deterministic, so in principle, setting temperature to 0 _should_ result in deterministic outputs. There are a few reasons it might not, but I don't think any of these apply when running a local model like the author did.

16 comments

aesthesia

317070 12 hours ago

> so in principle, setting temperature to 0 _should_ result in deterministic outputs

It is a common misconception, but it is not true even in principle. If I have 2 or more logits which are equal to the maximum of my logits, I will sample uniformly random from them with any temperature, even zero. Sampling from softmax([1, 0, 1]) is still stochastic at temperature 0, because the limit is to sample uniformly from the first or the last element.

Anyway: "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. GPUs put the associativity of the sums in matrix multiplications in arbitrary order, and this has a huge impact on the logits coming out of the neural network.

jstanley 10 hours ago
> "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs.
But this isn't a fundamental property of LLMs, it's just an implementation detail. It's pretty obvious that if you evaluate the matrix multiplications correctly and deterministically sample from the highest-probability outputs, you will have a deterministic LLM.
- vbarrielle 10 hours ago
  
  It may be an implementation detail, but in practice, if the only way to get a deterministic output is to run on the CPU, then it's not going to be usable.
  
  2 replies →
EvgeniyZh 12 hours ago

You don't have to sample uniformly. You could take the lowest index of all maxima. But yeah, the main source of randomness is non-deterministic matmul, and temperature does nothing with it
DougBTX 10 hours ago
> GPUs put the associativity of the sums in matrix multiplications in arbitrary order
That’s user-controlled too, not an inherent property of GPUs:
https://docs.pytorch.org/docs/2.12/generated/torch.use_deter...
- vbarrielle 9 hours ago
  
  The matrix multiplication is only deterministic for sparse-dense products under these settings:
  > torch.bmm() when called on sparse-dense CUDA tensors
  And it's not listed under the operations that raise an exception otherwise, so I'm not sure the docs promise that dense-dense matrix-matrix products are deterministic.
  
  1 reply →

easygenes 12 hours ago

There are. If the kernels are nondeterministic (e.g. timing issues) there are minor changes between runs, on a single system, even with eager decode enabled (typically what temperature=0 achieves).

IshKebab 12 hours ago

Setting the temperature to 0 should give deterministic results but that's not any better - it's just hiding the huge variance by only taking one sample.

croes 11 hours ago

So you would get always the same result, but it could be the wrong one

srdjanr 11 hours ago

Of course, nothing can guarantee the right answer from LLMs

valzam 12 hours ago

I mean the easiest explanation would be that the model harness doesn't always take the most likely token but does top-k sampling or similar. temperatur just means that probabilities get more and more equalized, boosting the chance that an unlikely token gets picked. but even with temp 0 you could have 0.8 T1, 0.19 T2, ... and sometimes sample T2

aesthesia 12 hours ago
No, this can't happen at temperature 0. The formula defining temperature-adjusted softmax isn't strictly defined at 0, but taking the limit (in the case where all logits are distinct) results in probability 1 being placed on the largest logit. Samplers will typically special case temperature 0 and pick the most likely token at each step.
- dvt 12 hours ago
  
  This is a very authoritative answer that should be more nuanced and caveated as implementation-dependent. In some cases, repetition penalties take precedence over sampling; top_k and top_p can also be handled before or after the temperature step. In other cases, `0` is turned into like 1e-10 or some super tiny float value (which can drift if you do any arithmetic with it). Routing, quantization, etc. can also have an effect on sampling. And yes, in some cases, setting temperature to 0 can mean "pure greedy decoding" which makes the decoder about as deterministic as it can get.