Comment by perching_aix

2 months ago

For those looking to save time, the answer is batched inference. Pretty much running multiple people's "prompts" through a model instance at the same time instead of just really tightly timesharing each model instance.

This is also why you may experience a variance in replies when using these services, even when you set the temperature to 0 and the seed to a fixed value. It's cause you don't control the other prompts yours get batched with. Could this be a data exfiltration attack vector? Probably, I didn't "research" that far.

27 comments

perching_aix

yjftsjthsd-h 2 months ago

> Pretty much running multiple people's "prompts" through a model instance at the same time instead of just really tightly timesharing each model instance.

I naively assumed providers did that with all models. Or does it only work for this (family of?) model(s)?

hansvm 2 months ago

It works for a lot of families but not all. You need a high enough degree of sharing of model weights between different queries for that to make sense (memory access being the usual bottleneck nowadays, though smaller models see something similar with matmul batch efficiencies for CPU related reasons).
Fully connected transformers trivially work (every weight for every query). MoE works beyond a certain size or with certain types of mixing (still using every weight, or using a high enough fraction that there's some sharing with batches of 20+ queries). As you push further that direction though (lots of techniques, but the key point being accessing less of the model at once and bypassing some of it for each query), you need larger and larger batches for those efficiency gains to materialize. At some point it becomes untenable because of latency waiting for batches of data, and past that it becomes untenable because of the volume of query data.

larodi 2 months ago

Batching. Yes.

And one thing it can help you locally is when you rate certain content and want to make sure it didn’t hallucinate. So you toss 3 or 5 times or… batch_size times .)

Curious that batch if has been there from day one, but it takes a while for people to see/grasp/grok it.

pcwelder 2 months ago

> other prompts yours get batched with

Why would batching lead to variance?

kouteiheika 2 months ago
> Why would batching lead to variance?
Depending on the shape of the data a slightly different kernel implementation (for e.g. matrix multiplication, etc.) will be the most optimal, and those will give slightly different results. There could also be other sources of non-determinism depending on the implementation (e.g. some kernels are inherently not entirely deterministic as they use tricks to go faster).
- zxexz 2 months ago
  
  Yep, this. I see a lot of other worryingly confident answers in the thread that are wrong.
  SGLang finally has at least some notes[0], but I’m always surprised there isn’t more of a community wide effort to trace down the sources of indeterminism.
  [0] https://docs.sglang.ai/references/faq.html
- delusional 2 months ago
  
  > not entirely deterministic
  There's a Nobel prize waiting for you if that's the case. I'll assume you meant theoretically consistent or accurate.
- bhickey 2 months ago
  
  Some of the non-determinism mentioned above manifests as sensitivity to _where_ data falls within a batch.
  
  1 reply →
imtringued 2 months ago

Attention doesn't get batched and the runtime of attention for a given users token depends on the total context length. Hence even in the ideal scenario of you getting a dedicated attention calculating GPU, the MLP calculating GPU doing batching will have to wait for the slowest user.
In the worst case scenario you are sharing a single attention calculating GPU with someone who has a super long context window, then that guy will be hogging most of the memory bandwidth of the GPU, even though you both are generating the same quantity of tokens.
This means that in the distributed setting, you will not only need dedicated GPUs for the model and attention calculations, you will also need to duplicate the whole setup for a variety of context lengths, so that long contexts are batches alongside other long contexts and short contexts are batches alongside other short contexts.
jerpint 2 months ago
Batching can lead to variance with things like batchnorm but most transformers use layer norm to avoid this problem
- amelius 2 months ago
  
  Batchnorm can only have an effect between batches during training, not inference.
Hendrikto 2 months ago
Because these models are context-sensitive. Every token can influence the output.
- immibis 2 months ago
  
  But not the tokens that don't even feed into your output because they're feeding into someone else's output. Separate items in batches don't get mixed up with each other - they just run the model separately on each item at the same time, like SIMD.
- simianwords 2 months ago
  
  I believe they are talking about latency variance. Batching can increase variance because you may have to wait for enough prompts to get to the batch size.
  
  4 replies →
empiko 2 months ago

In some mixture-of-experts approaches, samples or tokens are being distributed among experts. The experts are selected by trying to predict what is a good expert-sample match. Depending on your neighbors in the batch, you might be assigned different experts.

VectorLock 2 months ago

Sounds like an amazing attack vector if your prompts get mixed with other's.

energy123 2 months ago

What's the average batch size?

taneq 2 months ago

Wow, almost like Deepseek’s impressive performance is the result of optimisation by smart engineers.

perching_aix 2 months ago
Not sure why the snarky tone, didn't say or imply otherwise, nor did anyone else in the thread so far that I could see.
- taneq 2 months ago
  
  It wasn't meant to come across that snarky. Sorry about that. :/
draw_down 2 months ago

[dead]