Comment by pornel

6 months ago

No, it's more like sharding of parameters. There's no understandable distinction between the experts.

3 comments

pornel

I understand they're only optimizing for load distribution, but have people been trying to disentangle what the the various experts learn?

calaphos 6 months ago

Mixture of experts involves some trained router components which routes to specific experts depending on the input, but without any terms enforcing load distribution this tends to collapse during training where most information gets routed to just one or two experts.
pornel 6 months ago

Keep in mind that the "experts" are selected per layer, so it's not even a single expert selection you can correlate with a token, but an interplay of abstract features across many experts at many layers.