← Back to context

Comment by pornel

13 days ago

No, it's more like sharding of parameters. There's no understandable distinction between the experts.

I understand they're only optimizing for load distribution, but have people been trying to disentangle what the the various experts learn?

  • Mixture of experts involves some trained router components which routes to specific experts depending on the input, but without any terms enforcing load distribution this tends to collapse during training where most information gets routed to just one or two experts.

  • Keep in mind that the "experts" are selected per layer, so it's not even a single expert selection you can correlate with a token, but an interplay of abstract features across many experts at many layers.