Comment by zozbot234

3 days ago

Should be active param size, not model size.

1 comment

zozbot234

Yes, you’re right.

LLama 3.1 however is not MoE, so all params are active.

For MoE it is tricky, because for each token you only use a subset of params (an “expert”) but you don’t know which one, so you have to keep them all in memory or wait until it loads from slower storage, potentially different for each token.