Comment by itkovian_
17 hours ago
This is called linear mode connectivity and seems to work for almost every large model. So well that in most cases it’s an explicit part of the training process; do many training ‘branches’ then merge then continue.
It is not understood why it works so well.
is that actually how they train them in the datacenter? the trillion sized weight vector gets cloned and sent off to groups of GPUs and averaged after?