Comment by csomar

5 days ago

That's still an inference time issue. If you have perfect inference with a zero temperature, the models are deterministic. There is no intrinsic randomness in software-only computing.

3 comments

csomar

jmalicki 5 days ago

Floating point associativity differences can lead to non-determinism with 0 temperature if the order of operations are non-deterministic.

Anyone with reasonable experience with GPU computation who pays attention knows that even randomness in warp completion times can easy lead to non-determinism due to associativity differences.

For instance: https://www.twosigma.com/articles/a-workaround-for-non-deter...

It is very well known that CUDA isn't strongly deterministic due to these factors among practitioners.

Differences in batch sizes of inference compound these issues.

Edit: to be more specific, the non-determinism mostly comes from map-reduce style operations, where the map is deterministic, but the order that items are sent to the reduce steps (or how elements are arranged in the tree for a tree reduce) can be non-deterministic.

csomar 4 days ago
My point is, your inference process is the non-deterministic part; not the model itself.
- jmalicki 4 days ago
  
  Eh., if you have a PyTorch model that uses non-deterministic tensor operations like matrix multiplications, I think it is fair to call the model non-deterministic, since the matmul is not guaranteed to be deterministic - the non determinism of a matmul isn't a bug but a feature.
  See e.g.https://discuss.pytorch.org/t/why-is-torch-mm-non-determinis...