Comment by MoonGhost
13 days ago
> Anecdotally, I feel MoE models sometimes exhibit slightly less “deep” thinking
Makes sense to compare apples with apples. Same compute amount, right? Or you are giving less time to MoE model and then feel like it underperforms. Shouldn't be surprising...
> These experts are say 1/10 to 1/100 of your model size if it were a dense model
Just to be correct, each layer (attention + fully connected) has it's own router and experts. There are usually 30++ layers. It can't be 1/10 per expert as there are literally hundreds of them.
No comments yet
Contribute on Hacker News ↗