The router that routes the tokens between the "experts" is part of the training itself as well. The name MoE is really not a good acronym as it makes people believe it's on a more coarse level and that each of the experts somehow is trained by different corpus etc. But what do I know, there are new archs every week and someone might have done a MoE differently.
Check out Unsloths REAP models, you can outright delete a few of the lesser used experts without the model going braindead since they all can handle each token but some are better posed to do so.
It's not only per token, but also each layer has its own router and can choose different experts. https://huggingface.co/blog/moe#what-is-a-mixture-of-experts...
The router that routes the tokens between the "experts" is part of the training itself as well. The name MoE is really not a good acronym as it makes people believe it's on a more coarse level and that each of the experts somehow is trained by different corpus etc. But what do I know, there are new archs every week and someone might have done a MoE differently.
Check out Unsloths REAP models, you can outright delete a few of the lesser used experts without the model going braindead since they all can handle each token but some are better posed to do so.