← Back to context

Comment by lelandbatey

14 hours ago

As I understood it, the "randomness" affecting what is selected at any temperature still comes from a PRNG or CSPRNG (or whatever RNG you want, maybe a hardware one), and if you where to swap out that with something deterministic you'd get the same results every time (barring non-determinism in other parts of the OS/drivers/maybe even hardware).

But theoretically, the output of every LLM is seed-driven (or could be if you wrote the software to isolate it) just like any computer software. It's just none of the software written (even llama.cpp AFAIK) chooses to support stable-seeding due to the changes in stuff like CPU/Vulkan/CUDA/Metal differences making it difficult to make consistent.

They could though! Hopefully one day someone implements it into the mainstream LLM-engine software and it gets exposed in the APIs serving the models. It'd do a lot to show folks the "internals" of these models.

It's probably due to the fact that it's a cloud service. You have no guarantee that your next request will go to the same machine. So even with an identical seed, and temp 0 you might get different hardware and hence different accuracy/noise in the floating point operations.

  • How can there be noise in floating point operations? I could buy like completion order for parallized batches i.e. adding a+b+c is different from a+c+b etc.

    • Batching order, as you mentioned, matters a lot, and for any heavily optimized kernels it will change from one machine to the next. You also have the choice of backend numerical library from, e.g., different OS versions. There are floating-point bugs from time to time, especially in GPUs. Many operations (like transcendentals) are usually given a couple bits of wiggle room in the result. Another program executing could have changed the floating-point rounding mode on one device. More aggressive ML optimizers might automatically apply various forms of reduced precision to the requested high-level operation. If you have enough optimizations enabled, you might non-deterministically get compiled instructions like fmadd so that any one build of your library is deterministic (excluding other ideas mentioned above) but different machines with different builds (because of a staged rollout, different architectures, engineering mistakes, etc) can have different outputs. And so on.

    • IEEE-754 doesn't mandate exact results for functions like exp(x). It mandates things like "within 2 ULP of the true answer." Hardware vendors are free to implement these functions in any way that meets the error tolerance.

    • While the IEEE 754 standard ensures that individual basic operations are deterministic and strictly bounded, it does not guarantee that an entire program will yield bit-identical results on all CPUs.

      CPUs and their execution environments introduce subtle hardware variations, architecture choices, and compiler optimizations that break bit-level consistency.

      (same for GPU/TPU, ...)

      3 replies →

Stable seeding is not enough. A lot of modern, fast compute kernels are nondeterministic. Floating point multiplication/addition is not strictly associative and e.g. reductions can combine results from different threads in different orders (e.g. through atomic ops). You can write kernels to be deterministic, but it is generally less efficient.

  • They are only non-deterministic when you’re doing batching and a kernel ends up running across a “random” set of token streams. If you’re only processing one user’s request, they’re very much deterministic.