Comment by michalsustr
17 days ago
I recently attended a STAC conference where they claimed the GPUs themselves are not deterministic. The hand-wavy speculation is they need to temperature control the cores and the flop ops may be reordered during that process. (By temperature I mean physical temperature, not some nn sampling parameter). On such large scale of computation these small differences can show up in the actually different tokens.
I can assure you this isn't true. Having worked with GPUs for many years in an application where consist results are important it's not only possible but actually quite easy to ensure consistent inputs produce consistent results. The temperature and clock speed do not affect the order of operations, only the speed, and this doesn't affect the results. This is the same as with any modern CPU which will also adjust clock for temperature.