← Back to context

Comment by victorbjorklund

1 day ago

Notice that MOE isn’t different experts for different types of problems. It’s per token and not really connect to problem type.

So if you send a python code then the first one in function can be one expert, second another expert and so on.

Can you back this up with documentation? I don't believe that this is the case.

  • The router that routes the tokens between the "experts" is part of the training itself as well. The name MoE is really not a good acronym as it makes people believe it's on a more coarse level and that each of the experts somehow is trained by different corpus etc. But what do I know, there are new archs every week and someone might have done a MoE differently.

  • Check out Unsloths REAP models, you can outright delete a few of the lesser used experts without the model going braindead since they all can handle each token but some are better posed to do so.