Comment by 7777777phil

6 days ago

Been spending a bunch of time lately trying to figure out why these ~120B MoE models keep beating much larger dense ones.

With Mistral it's 128 experts but only 4 active per token, so any given forward pass is like 6B params. That's a very different kind of model than scaling a dense transformer bigger. Also wrote a little post on where I think this is going: https://philippdubach.com/posts/the-last-architecture-design...