Comment by perching_aix
2 months ago
For those looking to save time, the answer is batched inference. Pretty much running multiple people's "prompts" through a model instance at the same time instead of just really tightly timesharing each model instance.
This is also why you may experience a variance in replies when using these services, even when you set the temperature to 0 and the seed to a fixed value. It's cause you don't control the other prompts yours get batched with. Could this be a data exfiltration attack vector? Probably, I didn't "research" that far.
> Pretty much running multiple people's "prompts" through a model instance at the same time instead of just really tightly timesharing each model instance.
I naively assumed providers did that with all models. Or does it only work for this (family of?) model(s)?
It works for a lot of families but not all. You need a high enough degree of sharing of model weights between different queries for that to make sense (memory access being the usual bottleneck nowadays, though smaller models see something similar with matmul batch efficiencies for CPU related reasons).
Fully connected transformers trivially work (every weight for every query). MoE works beyond a certain size or with certain types of mixing (still using every weight, or using a high enough fraction that there's some sharing with batches of 20+ queries). As you push further that direction though (lots of techniques, but the key point being accessing less of the model at once and bypassing some of it for each query), you need larger and larger batches for those efficiency gains to materialize. At some point it becomes untenable because of latency waiting for batches of data, and past that it becomes untenable because of the volume of query data.
Batching. Yes.
And one thing it can help you locally is when you rate certain content and want to make sure it didn’t hallucinate. So you toss 3 or 5 times or… batch_size times .)
Curious that batch if has been there from day one, but it takes a while for people to see/grasp/grok it.
> other prompts yours get batched with
Why would batching lead to variance?
> Why would batching lead to variance?
Depending on the shape of the data a slightly different kernel implementation (for e.g. matrix multiplication, etc.) will be the most optimal, and those will give slightly different results. There could also be other sources of non-determinism depending on the implementation (e.g. some kernels are inherently not entirely deterministic as they use tricks to go faster).
Yep, this. I see a lot of other worryingly confident answers in the thread that are wrong.
SGLang finally has at least some notes[0], but I’m always surprised there isn’t more of a community wide effort to trace down the sources of indeterminism.
[0] https://docs.sglang.ai/references/faq.html
> not entirely deterministic
There's a Nobel prize waiting for you if that's the case. I'll assume you meant theoretically consistent or accurate.
Some of the non-determinism mentioned above manifests as sensitivity to _where_ data falls within a batch.
1 reply →
Attention doesn't get batched and the runtime of attention for a given users token depends on the total context length. Hence even in the ideal scenario of you getting a dedicated attention calculating GPU, the MLP calculating GPU doing batching will have to wait for the slowest user.
In the worst case scenario you are sharing a single attention calculating GPU with someone who has a super long context window, then that guy will be hogging most of the memory bandwidth of the GPU, even though you both are generating the same quantity of tokens.
This means that in the distributed setting, you will not only need dedicated GPUs for the model and attention calculations, you will also need to duplicate the whole setup for a variety of context lengths, so that long contexts are batches alongside other long contexts and short contexts are batches alongside other short contexts.
Batching can lead to variance with things like batchnorm but most transformers use layer norm to avoid this problem
Batchnorm can only have an effect between batches during training, not inference.
Because these models are context-sensitive. Every token can influence the output.
But not the tokens that don't even feed into your output because they're feeding into someone else's output. Separate items in batches don't get mixed up with each other - they just run the model separately on each item at the same time, like SIMD.
I believe they are talking about latency variance. Batching can increase variance because you may have to wait for enough prompts to get to the batch size.
4 replies →
In some mixture-of-experts approaches, samples or tokens are being distributed among experts. The experts are selected by trying to predict what is a good expert-sample match. Depending on your neighbors in the batch, you might be assigned different experts.
Sounds like an amazing attack vector if your prompts get mixed with other's.
What's the average batch size?
Wow, almost like Deepseek’s impressive performance is the result of optimisation by smart engineers.
Not sure why the snarky tone, didn't say or imply otherwise, nor did anyone else in the thread so far that I could see.
It wasn't meant to come across that snarky. Sorry about that. :/
[dead]