← Back to context

Comment by Hendrikto

2 months ago

Because these models are context-sensitive. Every token can influence the output.

But not the tokens that don't even feed into your output because they're feeding into someone else's output. Separate items in batches don't get mixed up with each other - they just run the model separately on each item at the same time, like SIMD.

I believe they are talking about latency variance. Batching can increase variance because you may have to wait for enough prompts to get to the batch size.