Comment by thesz
8 hours ago
https://en.wikipedia.org/wiki/Mixture_of_experts#Sparsely-ga...
"The sparsely-gated MoE layer,[21] published by researchers from Google Brain, uses feedforward networks as experts, and linear-softmax gating. Similar to the previously proposed hard MoE, they achieve sparsity by a weighted sum of only the top-k experts, instead of the weighted sum of all of them."
"Top-k experts," in case of some DeepSeek's models k=1.
No comments yet
Contribute on Hacker News ↗