Comment by brookst

5 months ago

If temperature is zero, and weights are weights, where is the non-deterministic behavior coming from?

21 comments

brookst

Temperature changes the distribution that is sampled, not if a distribution is sampled.

Temperature changes the softmax equation[1], not weather or not you are sampling from the softmax result or choosing the highest probability. IBM's documentation corroborates this, saying you need to set do_sample to True in order for the temperature to have any effect, e.g., T changes how we sample, not if we sample [2].

A similar discussion on openai forum also claim that the RNG might be in a different state from run to run, although I am less sure about that [3]

[1] https://pelinbalci.com/2023/10/16/Temperature_parameter.html

[2] https://www.ibm.com/think/topics/llm-temperature#:~:text=The...

[3] https://community.openai.com/t/clarifications-on-setting-tem...

zelphirkalt 5 months ago

I have dealt with traditional ML models in the past and things like tensorflow non-reproducibility. Managed to make them behave reproducibly. This is a very basic requirement. If we cannot even have that or people who deal with Gemini or similar models do not even know why they don't deliver reproducible results ... This seems very bad. It becomes outright unusable for anyone wanting to do research with reliable result. We already have a reproducibility crisis, because researchers often do not have the required knowledge to properly handle their tooling and would need a knowledgeable engineer to set it up. Only that most engineers don't know either and don't show enough attention to the detail to make reproducible software.
sidkshatriya 5 months ago

Your response is correct. However, you can choose to not sample from the distribution. You can have a rule to always choose the token with the highest probability generated by the softmax layer.
This approach should make the LLM deterministic regardless of the temperature chosen.
P.S. Choosing lower and lower temperatures will make the LLM more deterministic but it will never be totally deterministic because there will always be some probability in other tokens. Also it is not possible to use temperature as exactly 0 due to exp(1/T) blowup. Like I mentioned above, you could avoid fiddling with temperature and just decide to always choose token with highest probability for full determinism.
There are probably other more subtle things that might make the LLM non-deterministic from run to run though. It could be due to some non-deterministism in the GPU/CPU hardware. Floating point is very sensitive to ordering.
TL;DR for as much determinism as possible just choose token with highest probability (i.e. dont sample the distribution).

TeMPOraL 5 months ago

Here probably routing would be dominating, but in general, unless I missed all the vendors ditching GPUs and switching to ASICs optimized for fixed precision math, floating points are still non-commutative therefore results are non-deterministic wrt. randomness introduced by parallelising the calculations.

zelphirkalt 5 months ago
Of course which part of the calculations happens where should also be specifiable and be able to be made deterministicor should not have an effect on the result. A map reduce process' reduce step, merging results from various places also should be able to be made to give reproducible results, regardless of which results arrive first or from where.
Is our tooling too bad for this?
- TeMPOraL 5 months ago
  
  > Is our tooling too bad for this?
  Floating points are fundamentally too bad for this. We use them because they're fast, which usually more than compensates for inaccuracies FP math introduces.
  (One, dealing with FP errors is mostly a fixed cost - there's a branch of CS/mathematics specializing in it, producing formally proven recipes for computing specific things in way that minimize or at least give specific bounds on errors. That's work that can be done once, and reused forever. Two, most programmers are oblivious to those issues anyway, and we've learned to live with the bugs :).)
  When your parallel map-reduce is just doing matrix additions and multiplications, guaranteeing order of execution comes with serious overhead. For one, you need to have all partial results available together before reducing, so either the reduction step needs to have enough memory to store a copy of all the inputs, or it needs to block the units computing those inputs until all of them finish. Meanwhile, if you drop the order guarantee, then the reduction step just needs one fixed-size accumulator, and every parallel unit computing the inputs is free to go and do something else as soon as it's done.
  So the price you pay for deterministic order is either a reduction of throughput or increase in on-chip memory, both of which end up translating to slower and more expensive hardware. The incentives strongly point towards not giving such guarantees if it can be avoided - keep in mind that GPUs have been designed for videogames (and graphics in general), and for this, floating point inaccuracies only matter when they become noticeable to the user.
Dylan16807 5 months ago
Why would the same software on the same GPU architecture use different commutations from run to run?
Also if you're even considering fixed point math, you can use integer accumulators to add up your parallel chunks.
- TeMPOraL 5 months ago
  
  Why would the same multithreaded software run on the same CPU (not just architecture - the same physical chip) have its instructions execute in different order from run to run? Performance. Want things deterministic? You have to explicitly keep them in sync yourself. GPUs sport tens of thousands of parallel processors these days, which are themselves complex, and are linked together with more complexity, both hardware and software. They're designed to calculate fast, not to ensure every subprocessor is always in lock step with every other one.
  Model inference on GPU is mostly doing a lot of GPU equivalent of parallelized product on (X1, X2, X3, ... Xn), where each X is itself some matrix computed by a parallelized product of other matrices. Unless there's some explicit guarantee somewhere that the reduction step will pause until it gets all results so it can guarantee order, instead of reducing eagerly, each such step is a non-determinism transducer, turning undetermined execution order into floating point errors via commutation.
  I'm not a GPU engineer so I don't know for sure, especially about the new cards designed for AI, but since reducing eagerly allows more memory-efficient design and improves throughput, and GPUs until recently were optimized for games (where FP accuracy doesn't matter that much), and I don't recall any vendor making determinism a marketing point recently, I don't believe GPUs suddenly started to guarantee determinism at expense of performance.
  
  5 replies →

michalsustr 5 months ago

I recently attended a STAC conference where they claimed the GPUs themselves are not deterministic. The hand-wavy speculation is they need to temperature control the cores and the flop ops may be reordered during that process. (By temperature I mean physical temperature, not some nn sampling parameter). On such large scale of computation these small differences can show up in the actually different tokens.

fancyfredbot 5 months ago

I can assure you this isn't true. Having worked with GPUs for many years in an application where consist results are important it's not only possible but actually quite easy to ensure consistent inputs produce consistent results. The temperature and clock speed do not affect the order of operations, only the speed, and this doesn't affect the results. This is the same as with any modern CPU which will also adjust clock for temperature.

petesergeant 5 months ago

The parent is suggesting that temperature only applies at the generation step, but the choice of backend “expert model” that a request is given to (and then performs the generation) is non-deterministic. Rather than being a single set of weights, there are a few different sets of weights that constitute the “expert” in MoE. I have no idea if that’s true, but that’s the assertion

brookst 5 months ago
I don't think it makes sense? Somewhere there has to be a RNG for that to be true. MOE itself doesn't introduce randomness, and the routing to experts is part of the model weights, not (I think) a separate model.
- pigscantfly 5 months ago
  
  The samples your input is batched with on the provider's backend vary between calls and sparse mixture of experts routing when implemented for efficient utilization induces competition among tokens with either encouraged or enforced balance of expert usage among tokens in the same fixed-size group. I think it's unknown or at least undisclosed exactly why sequence non-determinism at zero temperature occurs in these proprietary implementations, but I think this is a good theory.
  [1] https://arxiv.org/abs/2308.00951 pg. 4 [2] https://152334h.github.io/blog/non-determinism-in-gpt-4/
  
  1 reply →
- menaerus 5 months ago
  
  You don't need RNG since the whole transformer is an extremely large floating-point arithmetic unit. A wild guess - how about the source of non-determinism is coming from the fact that, on the HW level, tensor execution order is not guaranteed and therefore (T0 * T1) * T2 can produce slightly different results than T0 * (T1 * T2) due to rounding errors?

daralthus 5 months ago

I have seen numbers come differently in JAX just depending on the batch size, simply because the compiler optimizes to a different sequence of operations on the hardware.