Comment by lordswork
14 days ago
MOE as an idea specific to neural networks has been around since 1991[1] . OP is probably aware, but adding for others following along, while MoE has roots in ensembling, there are some important differences: Traditional ensembles run all models in parallel and combine their outputs, whereas MoE uses a gating mechanism to activate only a subset of experts per input. This enables efficient scaling via conditional computation and expert specialization, rather than redundancy.
No comments yet
Contribute on Hacker News ↗