Comment by xnx
2 days ago
> It’s worth noting that LLMs are non-deterministic,
This is probably better phrased as "LLMs may not provide consistent answers due to changing data and built-in randomness."
Barring rare(?) GPU race conditions, LLMs produce the same output given the same inputs.
I don't think those race conditions are rare. None of the big hosted LLMs provide a temperature=0 plus fixed seed feature which they guarantee won't return different results, despite clear demand for that from developers.
I, naively (an uninformed guess), considered the non-determinism (multiple results possible, even with temperature=0 and fixed seed) stemming from floating point rounding errors propagating through the calculations. How wrong am I ?
You may be interested in https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm... .
> The non-determinism at temperature zero, we guess, is caused by floating point errors during forward propagation. Possibly the “not knowing what to do” leads to maximum uncertainty, so that logits for multiple completions are maximally close and hence these errors (which, despite a lack of documentation, GPT insiders inform us are a known, but rare, phenomenon) are more reliably produced.
Also uninformed but I can't see how that would be true, floating point rounding errors are entirely deterministic
2 replies →
They're gonna round the same each time you're running it on the same hardware.
1 reply →
With a fixed seed there will be the same floating point rounding errors.
A fixed seed is enough for determinism. You don't need to set temperature=0. Setting temperature=0 also means that you aren't sampling, which means that you're doing greedy one-step probability maximization which might mean that the text ends up strange for that reason.
Fair. I dislike "non-deterministic" as a blanket llm descriptor for all llms since it implies some type of magic or quantum effect.
I see LLM inference as sampling from a distribution. Multiple details go into that sampling - everything from parameters like temperature to numerical imprecision to batch mixing effects as well as the next-token-selection approach (always pick max, sample from the posterior distribution, etc). But ultimately, if it was truly important to get stable outputs, everything I listed above can be engineered (temp=0, very good numerical control, not batching, and always picking the max probability next token).
dekhn from a decade ago cared a lot about stable outputs. dekhn today thinks sampling from a distribution is a far more practical approach for nearly all use cases. I could see it mattering when the false negative rate of a medical diagnostic exceeded a reasonable threshold.
Errr... that word implies some type of non-deterministic effect. Like using a randomizer without specifying the seed (ie. sampling from a distribution). I mean, stuff like NFAs (non-deterministic finite automata) isn't magic.
Interesting, but in general it does not imply that. For example: https://en.wikipedia.org/wiki/Nondeterministic_finite_automa...
I agree its phrased poorly.
Better said would be: LLM's are designed to act as if they were non-deterministic.
1 reply →
> despite clear demand for that from developers
Theorizing about why that is: Could it be possible they can't do deterministic inference and batching at the same time, so the reason we see them avoiding that is because that'd require them to stop batching which would shoot up costs?
The many sources of stochastic/non-deterministic behavior have been mentioned in other replies but I wanted to point out this paper: https://arxiv.org/abs/2506.09501 which analyzes the issues around GPU non determinism (once sampling and batching related effects are removed).
One important take-away is that these issues are more likely in longer generations so reasoning models can suffer more.
FP multiplication is non-commutative.
It doesn’t mean it’s non-deterministic though.
But it does when coupled with non-deterministic requests batching, which is the case.
That's like you can't deduce the input t from a cryptographic hash h but the same input always gives you the same hash, so t->h is deterministic. h->t is, in practice, not a way that you can or want to walk (because it's so expensive to do) and because there may be / must be collisions (given that a typical hash is much smaller than the typical inputs), so the inverse is not h->t with a single input but h->{t1,t2,...}, a practically open set of possible inputs that is still deterministic.
I think the better statement is likely "LLMs are typically not executed in a deterministic manner", since you're right there are no non deterministic properties interment to the models themselves that I'm aware of
I run my local LLMs with a seed of one. If I re-run my "ai" command (which starts a conversation with its parameters as a prompt) I get exactly the same output every single time.
In my (poor) understanding, this can depend on hardware details. What are you running your models on? I haven't paid close attention to this with LLMs, but I've tried very hard to get non-deterministic behavior out of my training runs for other kinds of transformer models and was never able to on my 2080, 4090, or an A100. PyTorch docs have a note saying that in general it's impossible: https://docs.pytorch.org/docs/stable/notes/randomness.html
Inference on a generic LLM may not be subject to these non-determinisms even on a GPU though, idk
Ah. I've typically avoided CUDA except for a couple of really big jobs so I haven't noticed this.
Yes. This is what I was trying to say. Saying "It’s worth noting that LLMs are non-deterministic" is wrong and should be changed in the blog post.
> Saying "It’s worth noting that LLMs are non-deterministic" is wrong and should be changed in the blog post.
Every person in this thread understood that Simon meant "Grok, ChatGPT, and other common LLM interfaces run with a temperature>0 by default, and thus non-deterministically produce different outputs for the same query".
Sure, he wrote a shorter version of that, and because of that y'all can split hairs on the details ("yes it's correct for how most people interact with LLMs and for grok, but _technically_ it's not correct").
The point of English blog posts is not to be a long wall of logical prepositions, it's to convey ideas and information. The current wording seems fine to me.
The point of what he was saying was to caution readers "you might not get this if you try to repro it", and that is 100% correct.
5 replies →
You’re correct in batch size 1 (local is one), but not in production use case when multiple requests get batched together (and that’s how all the providers do this).
With batching matrix shapes/request position in them aren’t deterministic and this leads to non deterministic results, regardless of sampling temperature/seed.
2 replies →
"Non-deterministic" in the sense that a dice roll is when you don't know every parameter with ultimate precision. On one hand I find insistence on the wrongness on the phrase a bit too OCD, on the other I must agree that a very simple re-phrasing like "appears {non-deterministic|random|unpredictable} to an outside observer" would've maybe even added value even for less technically-inclined folks, so yeah.
That non-deterministic claim, along with the rather ludicrous claim that this is all just some accidental self-awareness of the model or something (rather than Elon clearly and obviously sticking his fat fingers into the machine), make the linked piece technically dubious.
A baked LLM is 100% deterministic. It is a straightforward set of matrix algebra with a perfectly deterministic output at a base state. There is no magic quantum mystery machine happening in the model. We add a randomization -- the seed or temperature -- to as a value-add randomize the outputs in the intention of giving creativity. So while it might be true that "in the customer-facing default state an LLM gives non-deterministic output", this is not some base truth about LLMs.
LLMs work using huge amounts of matrix multiplication.
Floating point multiplication is non-associative:
Almost all serious LLMs are deployed across multiple GPUs and have operations executed in batches for efficiency.
As such, the order in which those multiplications are run depends on all sorts of factors. There are no guarantees of operation order, which means non-associative floating point operations play a role in the final result.
This means that, in practice, most deployed LLMs are non-deterministic even with a fixed seed.
That's why vendors don't offer seed parameters accompanied by a promise that it will result in deterministic results - because that's a promise they cannot keep.
Here's an example: https://cookbook.openai.com/examples/reproducible_outputs_wi...
> Developers can now specify seed parameter in the Chat Completion request to receive (mostly) consistent outputs. [...] There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of our models.
>That's why vendors don't offer seed parameters accompanied by a promise that it will result in deterministic results - because that's a promise they cannot keep.
They absolutely can keep such a promise, which anyone who has worked with LLMs could confirm. I can run a sequence of tokens through a large LLMs thousands of times and get identical results every time (and have done precisely this! In fact, in one situation it was a QA test I built). I could run it millions of times and get exactly the same final layer every single time.
They don't want to keep such a promise because it limits flexibility and optimizations available when doing things at a very large scale. This is not an LLM thing, and saying "LLMs are non-deterministic" is simply wrong, even if you can find an LLM purveyor who decided to make choices where they no longer have any interest in such an outcome. And FWIW, non-associative floating point arithmetic is usually not the reason.
It's like claiming that a chef cannot do something that McDonalds and Burger King don't do, using those purveyors as an example of what is possible when cooking. Nothing works like that.
2 replies →
> Barring rare(?) GPU race conditions, LLMs produce the same output given the same inputs.
Are these LLMs in the room with us?
Not a single LLM available as a SaaS is deterministic.
As for other models: I've only run ollama locally, and it, too, provided different answers for the same question five minutes apart
Edit/update: not a single LLM available as a SaaS's output is deterministic, especially when used from a UI. Pointing out that you could probably run a tightly controlled model in a tightly controlled environment to achieve deterministic output is very extremely irrelevant when describing output of grok in situations when the user has no control over it
The models themselves are mathematically deterministic. We add randomness during the sampling phase, which you can turn off when running the models locally.
The SaaS APIs are sometimes nondeterministic due to caching strategies and load balancing between experts on MoE models. However, if you took that model and executed it in single user environment, it could also be done deterministically.
> However, if you took that model and executed it in single user environment,
Again, are those environments in the room with us?
In the context of the article, is the model executed in such an environment? Do we even know anything about the environment, randomness, sampling and anything in between or have any control over it (see e.g https://news.ycombinator.com/item?id=44528930)?
3 replies →
> Not a single LLM available as a SaaS is deterministic.
Gemini Flash has deterministic outputs, assuming you're referring to temperature 0 (obviously). Gemini Pro seems to be deterministic within the same kernel (?) but is likely switching between a few different kernels back and forth, depending on the batch or some other internal grouping.
And it's the author of the original article running Gemkni Flash/GemmniPro through an API where he can control the temperature? can kernels be controlled by the user? Any of those can be controlled through the UI/apis where most of these LLMs are involved from?
> but is likely switching between a few different kernels back and forth, depending on the batch or some other internal grouping.
So you're literally saying it's non-deterministic
2 replies →
> Not a single LLM available as a SaaS is deterministic.
Lower the temperature parameter.
It's not enough. Ive done this and still often gotten different results for the same question.
So, how does one do it outside of APIs in the context we're discussing? In the UI or when invoking @grok in X?
How do we also turn off all the intermediate layers in between that we don't know about like "always rant about white genocide in South Africa" or "crash when user mentions David Meyer"?
2 replies →
Akchally... Strictly speaking and to the best of my understanding, LLMs are deterministic in the sense that a dice roll is deterministic; the randomness comes from insufficient knowledge about its internal state. But use a constant seed and run the model with the same sequence of questions, you will get the same answers. It's possible that the interactions with other users who use the model in parallel could influence the outcome, but given that the state-of-the-art technique to provide memory and context is to re-submit the entirety of the current chat I'd doubt that. One hint that what I surmise is in fact true can be gleaned from those text-to-image generators that allow seeds to be set; you still don't get a 'linear', predictable (but hopefully a somewhat-sensible) relation between prompt to output, but each (seed, prompt) pair will always give the same sequence of images.
True.
I'm now wondering, would it be desirable to have deterministic outputs on an LLM?