Comment by jmalicki

5 days ago

Floating point associativity differences can lead to non-determinism with 0 temperature if the order of operations are non-deterministic.

Anyone with reasonable experience with GPU computation who pays attention knows that even randomness in warp completion times can easy lead to non-determinism due to associativity differences.

For instance: https://www.twosigma.com/articles/a-workaround-for-non-deter...

It is very well known that CUDA isn't strongly deterministic due to these factors among practitioners.

Differences in batch sizes of inference compound these issues.

Edit: to be more specific, the non-determinism mostly comes from map-reduce style operations, where the map is deterministic, but the order that items are sent to the reduce steps (or how elements are arranged in the tree for a tree reduce) can be non-deterministic.