← Back to context

Comment by lordswork

14 days ago

MOE as an idea specific to neural networks has been around since 1991[1] . OP is probably aware, but adding for others following along, while MoE has roots in ensembling, there are some important differences: Traditional ensembles run all models in parallel and combine their outputs, whereas MoE uses a gating mechanism to activate only a subset of experts per input. This enables efficient scaling via conditional computation and expert specialization, rather than redundancy.

[1]:https://ieeexplore.ieee.org/document/6797059