← Back to context

Comment by MoonGhost

13 days ago

> Anecdotally, I feel MoE models sometimes exhibit slightly less “deep” thinking

Makes sense to compare apples with apples. Same compute amount, right? Or you are giving less time to MoE model and then feel like it underperforms. Shouldn't be surprising...

> These experts are say 1/10 to 1/100 of your model size if it were a dense model

Just to be correct, each layer (attention + fully connected) has it's own router and experts. There are usually 30++ layers. It can't be 1/10 per expert as there are literally hundreds of them.