Comment by pigscantfly
17 days ago
The samples your input is batched with on the provider's backend vary between calls and sparse mixture of experts routing when implemented for efficient utilization induces competition among tokens with either encouraged or enforced balance of expert usage among tokens in the same fixed-size group. I think it's unknown or at least undisclosed exactly why sequence non-determinism at zero temperature occurs in these proprietary implementations, but I think this is a good theory.
[1] https://arxiv.org/abs/2308.00951 pg. 4 [2] https://152334h.github.io/blog/non-determinism-in-gpt-4/
I thought the temperature only affects randomness at the end of the network (when turning embeddings back I to words using the softmax). It cannot influence routing, which is inherently influenced by which examples get batched together (ie, it might depend on other users of the system)