But not the tokens that don't even feed into your output because they're feeding into someone else's output. Separate items in batches don't get mixed up with each other - they just run the model separately on each item at the same time, like SIMD.
I believe they are talking about latency variance. Batching can increase variance because you may have to wait for enough prompts to get to the batch size.
Variance based on actual randomness would be one thing, but to me variance based on what other people are running seems concerning, for reasons I can't quite articulate. I don't want the model to reply to a question in one domain based on what a large group of other people are thinking in a different domain (e.g. if they're discussing the news with chatgpt).
But not the tokens that don't even feed into your output because they're feeding into someone else's output. Separate items in batches don't get mixed up with each other - they just run the model separately on each item at the same time, like SIMD.
I believe they are talking about latency variance. Batching can increase variance because you may have to wait for enough prompts to get to the batch size.
No, I meant that the responses will be different run-to-run. [0]
[0] https://152334h.github.io/blog/non-determinism-in-gpt-4/
Variance based on actual randomness would be one thing, but to me variance based on what other people are running seems concerning, for reasons I can't quite articulate. I don't want the model to reply to a question in one domain based on what a large group of other people are thinking in a different domain (e.g. if they're discussing the news with chatgpt).
2 replies →