Comment by altcognito
18 hours ago
Explain this though. The code is deterministic, even if it relies on pseudo random number generation. It doesn't just happen, someone has to make a conscious decision to force a different code path (or model) if the system is loaded.
Its not deterministic. Any individual floating point mul/add is deterministic, but in a GPU these are all happening in parallel and the accumulation is in the order they happen to complete.
When you add A then B then C, you get a different answer than C then A then B, because floating point, approximation error, subnormals etc.
It can be made deterministic. It's not trivial and can slow it down a bit (not much) but there are environment variables you can set to make your GPU computations bitwise reproducible. I have done this in training models with Pytorch.
There are settings to make it reproducible but they incur a non-negligible drop in performance.
Unsurprising given they amount to explicit synchronization to make the order of operations deterministic.
Not deterministic. https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
For all practical purposes any code reliant on the output of a PRNG is non-deterministic in all but the most pedantic senses... And if the LLM temperature isn't set to 0 LLMs are sampling from a distribution.
If you're going to call a PRNG deterministic then the outcome of a complicated concurrent system with no guaranteed ordering is going to be deterministic too!
No, this isn't right. There are totally legitimate use cases for PRNGs as sources of random number sequences following a certain probability distribution where freezing the seed and getting reproducibility is actually required.
And for a complicated concurrent system you can also replay the exact timings and orderings as well!
1 reply →
How is this related to overloading? The nondeterminism should not be a function of overloading. It should just time out or reply slower. It will only be dumber if it gets rerouted to a dumber, faster model eg quantized.
Temperature can't be literally zero, or it creates a divide by zero error.
When people say zero, it is shorthand for “as deterministic as this system allows”, but it's still not completely deterministic.
Zero temp just uses argmax, which is what softmax approaches if you take the limit of T to zero anyway. So it could very well be deterministic.
Floating point math isn't associative for operations that are associative in normal math.
That would just add up to statistical noise instead of 10% degradation over a week.
Catastrophic error accumulation can produce more profound effects than noise.
1 reply →
It takes a different code path for efficiency.
e.g
if (batch_size > 1024): kernel_x else: kernel_y
There's a million algorithms to make LLM inference more efficient as a tradeoff for performance, like using a smaller model, using quantized models, using speculative decoding with a more permissive rejection threshold, etc etc