Comment by simonw
1 day ago
LLMs work using huge amounts of matrix multiplication.
Floating point multiplication is non-associative:
a = 0.1, b = 0.2, c = 0.3
a * (b * c) = 0.006
(a * b) * c = 0.006000000000000001
Almost all serious LLMs are deployed across multiple GPUs and have operations executed in batches for efficiency.
As such, the order in which those multiplications are run depends on all sorts of factors. There are no guarantees of operation order, which means non-associative floating point operations play a role in the final result.
This means that, in practice, most deployed LLMs are non-deterministic even with a fixed seed.
That's why vendors don't offer seed parameters accompanied by a promise that it will result in deterministic results - because that's a promise they cannot keep.
Here's an example: https://cookbook.openai.com/examples/reproducible_outputs_wi...
> Developers can now specify seed parameter in the Chat Completion request to receive (mostly) consistent outputs. [...] There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of our models.
>That's why vendors don't offer seed parameters accompanied by a promise that it will result in deterministic results - because that's a promise they cannot keep.
They absolutely can keep such a promise, which anyone who has worked with LLMs could confirm. I can run a sequence of tokens through a large LLMs thousands of times and get identical results every time (and have done precisely this! In fact, in one situation it was a QA test I built). I could run it millions of times and get exactly the same final layer every single time.
They don't want to keep such a promise because it limits flexibility and optimizations available when doing things at a very large scale. This is not an LLM thing, and saying "LLMs are non-deterministic" is simply wrong, even if you can find an LLM purveyor who decided to make choices where they no longer have any interest in such an outcome. And FWIW, non-associative floating point arithmetic is usually not the reason.
It's like claiming that a chef cannot do something that McDonalds and Burger King don't do, using those purveyors as an example of what is possible when cooking. Nothing works like that.
If not non-associative floating point, what's the reason?
There are a huge number of reasons for large scale systems. Batching sizes when hitting MoE systems (which are basically all LLMs now) leading to routing variations. Consecutive submissions could be routed to entirely different hardware, software, and even quantization levels! Repeat resubmissions could even hit different variations of a model.
No one targets determinism because randomness/"creativity" in LLMs is considered a prime feature, so there is zero reason to avoid variation, but that isn't some core function of LLMs.