Comment by stingraycharles
3 days ago
Means the temperature should be set to 0 (which not every provider supports) so that the output becomes entirely deterministic. Right now with most models if you give the same input prompt twice it will give two different solutions.
Even at temp 0, you might get different answers, depending on your inference engine. There might be hardware differences, as well as software issues (e.g. vLLM documents this, if you're using batching, you might get different answers depending on where in the batch sequence your query landed).
Production inference is not deterministic because of sharding (i.e. parameter weights on several GPUs on the same machine or MoE), timing-based kernel choices (e.g. torch.backends.cudnn.benchmark), or batched routing in MoEs. Probably best to host a small model yourself.
Claude Code already uses a temperature of 0 (just inspect the requests) but it's not deterministic
Not to mention it also performs web searches, web fetching etc which would also make it not deterministic
Two years ago when I was working on this at a startup, setting OAI models’ temp to 0 still didn’t make them deterministic. Has that changed?
Do LLMs inference engines have a way to seed their randomness? so tho have reproducible outputs with still some variance if desired?
Yes, although it's not always exposed to the end user of LLM providers.
This is good: run it n times, have the model review them and pick the best one.
I would only care about more deterministic output if I was repeating the same process with the same model, which is not the point of the exercise.