Comment by calaphos
14 days ago
Mixture of experts involves some trained router components which routes to specific experts depending on the input, but without any terms enforcing load distribution this tends to collapse during training where most information gets routed to just one or two experts.
No comments yet
Contribute on Hacker News ↗