← Back to context

Comment by pcwelder

2 months ago

> other prompts yours get batched with

Why would batching lead to variance?

> Why would batching lead to variance?

Depending on the shape of the data a slightly different kernel implementation (for e.g. matrix multiplication, etc.) will be the most optimal, and those will give slightly different results. There could also be other sources of non-determinism depending on the implementation (e.g. some kernels are inherently not entirely deterministic as they use tricks to go faster).

  • Yep, this. I see a lot of other worryingly confident answers in the thread that are wrong.

    SGLang finally has at least some notes[0], but I’m always surprised there isn’t more of a community wide effort to trace down the sources of indeterminism.

    [0] https://docs.sglang.ai/references/faq.html

  • > not entirely deterministic

    There's a Nobel prize waiting for you if that's the case. I'll assume you meant theoretically consistent or accurate.

  • Some of the non-determinism mentioned above manifests as sensitivity to _where_ data falls within a batch.

    • In my experience with other regular models, once the context starts to fill up, quality starts to degrade.

      wouldn't getting batched at the end of a batch, have a similar -effect- on the results, where your prompt might recieve overall less attention focused into it, if the context window is almost full?

      Idk just going by the vibes

Attention doesn't get batched and the runtime of attention for a given users token depends on the total context length. Hence even in the ideal scenario of you getting a dedicated attention calculating GPU, the MLP calculating GPU doing batching will have to wait for the slowest user.

In the worst case scenario you are sharing a single attention calculating GPU with someone who has a super long context window, then that guy will be hogging most of the memory bandwidth of the GPU, even though you both are generating the same quantity of tokens.

This means that in the distributed setting, you will not only need dedicated GPUs for the model and attention calculations, you will also need to duplicate the whole setup for a variety of context lengths, so that long contexts are batches alongside other long contexts and short contexts are batches alongside other short contexts.

Batching can lead to variance with things like batchnorm but most transformers use layer norm to avoid this problem

Because these models are context-sensitive. Every token can influence the output.

In some mixture-of-experts approaches, samples or tokens are being distributed among experts. The experts are selected by trying to predict what is a good expert-sample match. Depending on your neighbors in the batch, you might be assigned different experts.