Comment by kannanvijayan

5 days ago

Interesting. So what's the strategy there? Just assume that each expert will learn some underlying clustering of semantic associations, but not direct it?

Not even that. The "experts" are not expert in any particular topic.

MoE is an architecture change meant to lower the total compute for both training and serving an LLM. You basically have many smaller models (unfortunately called experts) and a router on top of them. The router "learns" which expert to activate for the next token generation, but that doesn't need to follow any semantic association. For the same math problem you could get experts 1 and 234 activate on the first token, 5 and 132 on the 2nd token and so on.