Comment by johndough

2 hours ago

Determinism of LLMs has often been discussed on HN, for example here:

https://news.ycombinator.com/item?id=45200925

The TL;DR is that LLMs are often not deterministic because GPUs compute submatrices in parallel and sum them up in different orders, depending on which finish first. This is maybe a few percent faster than always using the same order, but it absolutely could be made deterministic if people cared enough. CUDA even provides deterministic primitives if desired. Of course also use the same random seed for samplers, but that is trivial.