Comment by csomar

4 days ago

The models are deterministic, the inference is not.

6 comments

csomar

Which is a useless distinction. When we say models in this context we mean the whole LLM + infrastructure to serve it (including caches, etc).

jmalicki 4 days ago

What does that even mean?

Even then, depending on the specific implementation, associativity of floating point could be an issue between batch sizes, between exactly how KV cache is implemented, etc.

csomar 4 days ago
That's still an inference time issue. If you have perfect inference with a zero temperature, the models are deterministic. There is no intrinsic randomness in software-only computing.
- jmalicki 4 days ago
  
  Floating point associativity differences can lead to non-determinism with 0 temperature if the order of operations are non-deterministic.
  Anyone with reasonable experience with GPU computation who pays attention knows that even randomness in warp completion times can easy lead to non-determinism due to associativity differences.
  For instance: https://www.twosigma.com/articles/a-workaround-for-non-deter...
  It is very well known that CUDA isn't strongly deterministic due to these factors among practitioners.
  Differences in batch sizes of inference compound these issues.
  Edit: to be more specific, the non-determinism mostly comes from map-reduce style operations, where the map is deterministic, but the order that items are sent to the reduce steps (or how elements are arranged in the tree for a tree reduce) can be non-deterministic.
  
  2 replies →