Comment by pcwelder

2 months ago

> other prompts yours get batched with

Why would batching lead to variance?

16 comments

pcwelder

> Why would batching lead to variance?

Depending on the shape of the data a slightly different kernel implementation (for e.g. matrix multiplication, etc.) will be the most optimal, and those will give slightly different results. There could also be other sources of non-determinism depending on the implementation (e.g. some kernels are inherently not entirely deterministic as they use tricks to go faster).

zxexz 2 months ago

Yep, this. I see a lot of other worryingly confident answers in the thread that are wrong.
SGLang finally has at least some notes[0], but I’m always surprised there isn’t more of a community wide effort to trace down the sources of indeterminism.
[0] https://docs.sglang.ai/references/faq.html
delusional 2 months ago

> not entirely deterministic
There's a Nobel prize waiting for you if that's the case. I'll assume you meant theoretically consistent or accurate.
bhickey 2 months ago
Some of the non-determinism mentioned above manifests as sensitivity to _where_ data falls within a batch.
- tough 2 months ago
  
  In my experience with other regular models, once the context starts to fill up, quality starts to degrade.
  wouldn't getting batched at the end of a batch, have a similar -effect- on the results, where your prompt might recieve overall less attention focused into it, if the context window is almost full?
  Idk just going by the vibes

imtringued 2 months ago

Attention doesn't get batched and the runtime of attention for a given users token depends on the total context length. Hence even in the ideal scenario of you getting a dedicated attention calculating GPU, the MLP calculating GPU doing batching will have to wait for the slowest user.

In the worst case scenario you are sharing a single attention calculating GPU with someone who has a super long context window, then that guy will be hogging most of the memory bandwidth of the GPU, even though you both are generating the same quantity of tokens.

This means that in the distributed setting, you will not only need dedicated GPUs for the model and attention calculations, you will also need to duplicate the whole setup for a variety of context lengths, so that long contexts are batches alongside other long contexts and short contexts are batches alongside other short contexts.

jerpint 2 months ago

Batching can lead to variance with things like batchnorm but most transformers use layer norm to avoid this problem

amelius 2 months ago

Batchnorm can only have an effect between batches during training, not inference.

Hendrikto 2 months ago

Because these models are context-sensitive. Every token can influence the output.

immibis 2 months ago

But not the tokens that don't even feed into your output because they're feeding into someone else's output. Separate items in batches don't get mixed up with each other - they just run the model separately on each item at the same time, like SIMD.
simianwords 2 months ago
I believe they are talking about latency variance. Batching can increase variance because you may have to wait for enough prompts to get to the batch size.
- perching_aix 2 months ago
  
  No, I meant that the responses will be different run-to-run. [0]
  [0] https://152334h.github.io/blog/non-determinism-in-gpt-4/
  
  3 replies →

empiko 2 months ago

In some mixture-of-experts approaches, samples or tokens are being distributed among experts. The experts are selected by trying to predict what is a good expert-sample match. Depending on your neighbors in the batch, you might be assigned different experts.