Comment by xscott

2 days ago

Of course I can't be certain, but I think the "mixture of experts" design plays into it too. Metaphorically, there's a mid-level manager who looks at your prompt and tries to decide which experts it should be sent to. If he thinks you won't notice, he saves money by sending it to the undergraduate intern.

Just a theory.

5 comments

xscott

victorbjorklund 2 days ago

Notice that MOE isn’t different experts for different types of problems. It’s per token and not really connect to problem type.

So if you send a python code then the first one in function can be one expert, second another expert and so on.

dotancohen 2 days ago
Can you back this up with documentation? I don't believe that this is the case.
- smallerize 1 day ago
  
  It's not only per token, but also each layer has its own router and can choose different experts. https://huggingface.co/blog/moe#what-is-a-mixture-of-experts...
- l33tman 2 days ago
  
  The router that routes the tokens between the "experts" is part of the training itself as well. The name MoE is really not a good acronym as it makes people believe it's on a more coarse level and that each of the experts somehow is trained by different corpus etc. But what do I know, there are new archs every week and someone might have done a MoE differently.
- pixelmelt 2 days ago
  
  Check out Unsloths REAP models, you can outright delete a few of the lesser used experts without the model going braindead since they all can handle each token but some are better posed to do so.