Comment by thesz

8 hours ago

https://en.wikipedia.org/wiki/Mixture_of_experts#Sparsely-ga...

"The sparsely-gated MoE layer,[21] published by researchers from Google Brain, uses feedforward networks as experts, and linear-softmax gating. Similar to the previously proposed hard MoE, they achieve sparsity by a weighted sum of only the top-k experts, instead of the weighted sum of all of them."

"Top-k experts," in case of some DeepSeek's models k=1.

0 comments

thesz

No comments yet

Contribute on Hacker News ↗