← Back to context

Comment by itkovian_

17 hours ago

This is called linear mode connectivity and seems to work for almost every large model. So well that in most cases it’s an explicit part of the training process; do many training ‘branches’ then merge then continue.

It is not understood why it works so well.

is that actually how they train them in the datacenter? the trillion sized weight vector gets cloned and sent off to groups of GPUs and averaged after?