Comment by exe34
2 months ago
Variance based on actual randomness would be one thing, but to me variance based on what other people are running seems concerning, for reasons I can't quite articulate. I don't want the model to reply to a question in one domain based on what a large group of other people are thinking in a different domain (e.g. if they're discussing the news with chatgpt).
This definitely happens, and I'm surprised it's not talked about more often. Some attention kernels are more susceptible to this than others (I've found that paged attention is better than just naive attention, for example).
To be fair, I suppose people do it too - if you ask me a question about A, often as not the answer will be coloured by the fact that I just learnt about B.