Comment by NitpickLawyer
6 days ago
> Mixture-of-Experts seems like an attempt to do this - the domain structure being extracted into specific sub-models that are presumably trained on particular domain-associated content
This is a common miss-conception. MoE LLMs are NOT trained with each expert receiving domain-associated data. It's just an unfortunate naming decision that stuck, and is commonly miss-understood by non practitioners.
Interesting. So what's the strategy there? Just assume that each expert will learn some underlying clustering of semantic associations, but not direct it?
Not even that. The "experts" are not expert in any particular topic.
MoE is an architecture change meant to lower the total compute for both training and serving an LLM. You basically have many smaller models (unfortunately called experts) and a router on top of them. The router "learns" which expert to activate for the next token generation, but that doesn't need to follow any semantic association. For the same math problem you could get experts 1 and 234 activate on the first token, 5 and 132 on the 2nd token and so on.