Comment by an0malous

22 days ago

Aren’t transformers intrinsically deterministic? I thought the randomness was intentional to make chatbots seem more natural, and OpenAI used to have a seed parameter you could set for deterministic output. I don’t know why that feature isn’t more popular, for the reasons this article outlines

20 comments

an0malous

jkaptur 22 days ago

(I'm not an expert. I'd love to be corrected by someone who actually knows.)

Floating-point arithmetic is not associative. (A+B)+C does not necessarily equal A+(B+C), but you can get a performance improvement by calculating A, B, and C in parallel, then adding together whichever two finish first. So, in theory, transformers can be deterministic, but in a real system they almost always aren't.

10000truths 22 days ago
Not an expert either, but my understanding is that large models use quantized weights and tensor inputs for inference. Multiplication and addition of fixed-point values is associative, so unless there's an intermediate "convert to/from IEEE float" step (activation functions, maybe?), you can still build determinism into a performant model.
- kimixa 22 days ago
  
  Fixed point arithmetic isn't truly associative unless they have infinite precision. The second you hit a limit or saturate/clamp a value the result very much depends on order of operations.
  
  2 replies →
Const-me 20 days ago

> you can get a performance improvement by calculating A, B, and C in parallel, then adding together whichever two finish first
Technically possible, but I think unlikely to happen in practice.
On the higher level, these large models are sequential and there’s nothing to parallelize. The inference is a continuous chain of data dependencies between temporary tensors which makes it impossible to compute different steps in parallel.
On the lower level, each step is a computationally expensive operation on a large tensor/matrix. These tensors are often millions of numbers, the problem is very parallelizable, and the tactics to do that efficiently are well researched because matrix linear algebra is in wide use for decades. However, it’s both complicated and slow to implement fine grained parallelism like “adding together whichever two finish first” on modern GPUs. Just too much synchronization, when total count of active threads is many thousands, too expensive. Instead, operations like matrix multiplications are often assigning 1 thread per output element or fixed count of output elements, and reduction like softmax or vector dot product are using a series of exponentially decreasing reduction steps, i.e. order is deterministic.
However, that order may change with even minor update of any parts of the software, including opaque pieces at the low level like GPU drivers and firmware. Library developers are updating GPU kernels, drivers, firmware and OS kernels collectively implementing scheduler which assigns work to cores, both may affect order of these arithmetic operations.
amelius 20 days ago
I don't think the order of operations is non-deterministic between different runs. That would make programming and researching these systems more difficult than necessary.
- saagarjha 20 days ago
  
  It would be if you used atomics.
  
  1 reply →
saagarjha 20 days ago

It’s usually not too difficult or expensive to avoid doing this.

janalsncm 22 days ago

Transformers are just a special kind of binary which are run by inference code. Where the rubber meets the road is whether the inference setup is deterministic. There’s some literature on this: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

I don’t think the issue is determinism per se but chaotic predictions that are difficult to rely on.

an0malous 22 days ago

I agree they could be chaotic but I think that’s an important distinction

clarionbell 20 days ago

No, not unless you have a very specific notion of determinism. Some basic operations use arithmetic with finite precision in a way that isn't associative and therefore isn't reproducible. And CUDA introduces its own set of problems[1].

[1] https://docs.nvidia.com/cuda/cublas/index.html#results-repro...

solsane 22 days ago

Well, you could say that about computers in general. I'm assuming you're referring to temperature (or something similar) which can be set to always pick the most probable token. Floats aside, this should be deterministic. But practically I don't think that changes much since adjusting the input slightly can lead to very different output. Also back in the day the temperature helped it avoid cyclic loops

an0malous 22 days ago

Yes but chaotic is very different than non deterministic, and not just in an academic way because e.g. I can write tests against chaotic outputs but not really against non deterministic outputs.

johndough 20 days ago

Determinism of LLMs has often been discussed on HN, for example here:

https://news.ycombinator.com/item?id=45200925

The TL;DR is that LLMs are often not deterministic because GPUs compute submatrices in parallel and sum them up in different orders, depending on which finish first. This is maybe a few percent faster than always using the same order, but it absolutely could be made deterministic if people cared enough. CUDA even provides deterministic primitives if desired. Of course also use the same random seed for samplers, but that is trivial.

bpodgursky 22 days ago

Strict deterministic output for a given prompt prevents the use of RAG, which increasingly limits the relative utility of a LLM within an organization.

qayxc 20 days ago

How so? RAG is just a mechanism for querying external data sources. I don't see any need for non-determinism there.

esafak 22 days ago

The models generate a token distribution. Which one to pick is a choice. One can sample from the distribution, hence the randomness.

ares623 22 days ago