Comment by rightbyte

10 hours ago

How can there be noise in floating point operations? I could buy like completion order for parallized batches i.e. adding a+b+c is different from a+c+b etc.

7 comments

rightbyte

hansvm 37 minutes ago

Batching order, as you mentioned, matters a lot, and for any heavily optimized kernels it will change from one machine to the next. You also have the choice of backend numerical library from, e.g., different OS versions. There are floating-point bugs from time to time, especially in GPUs. Many operations (like transcendentals) are usually given a couple bits of wiggle room in the result. Another program executing could have changed the floating-point rounding mode on one device. More aggressive ML optimizers might automatically apply various forms of reduced precision to the requested high-level operation. If you have enough optimizations enabled, you might non-deterministically get compiled instructions like fmadd so that any one build of your library is deterministic (excluding other ideas mentioned above) but different machines with different builds (because of a staged rollout, different architectures, engineering mistakes, etc) can have different outputs. And so on.

StilesCrisis 6 hours ago

IEEE-754 doesn't mandate exact results for functions like exp(x). It mandates things like "within 2 ULP of the true answer." Hardware vendors are free to implement these functions in any way that meets the error tolerance.

toolslive 8 hours ago

While the IEEE 754 standard ensures that individual basic operations are deterministic and strictly bounded, it does not guarantee that an entire program will yield bit-identical results on all CPUs.

CPUs and their execution environments introduce subtle hardware variations, architecture choices, and compiler optimizations that break bit-level consistency.

(same for GPU/TPU, ...)

vlovich123 7 hours ago
Parent is correct - the math is very deterministic if you can guarantee it’s running repeatedly on the same machine and you’re not processing “random” requests in parallel. The compiler is irrelevant because once the code is generated it’s not getting recompiled and thus isn’t a source of non determinism (and generally if you don’t touch the math the compiler will consistently emit the same underlying machine code).
- simiones 6 hours ago
  
  This sub-thread was about cloud environments, where different requests may be served by different hardware. And it's in fact very likely that there will be a mix of different hardware from different vendors, in any particular LLM cloud for now.
throwaway173738 7 hours ago

It is, after all, a fundamentally voltage-based process, and the logical “no-man’s land” is chosen to limit the likelihood of a weak component producing faulty logic, but it’s impractical to run through the set of all possible starting states and to verify that after an unbounded number of clock steps the machine reaches a predictable end state on all of the devices being manufactured.