Comment by NitpickLawyer
5 hours ago
> With MoE models, you could fetch expert layers from the network on demand
This is a common misconception, probably due to the unfortunate naming. Expert layers are not "expert" at any particular subject, and active-size only refers to the activated layers per token. You'd still need all (or most of all) the layers for any particular query, even if some layers have a very low chance of being activated.
All in all, you'd be better off with lazy loading the entire model, at least you'd know you have the capability to run inference from then on.
Ultimately it would amount to lazy-loading the model, but the parameters themselves would be fetched from the network as needed, which still decreases time-to-first-token. It's true that "expert" choices will span most of the model, regardless of any particular "subject" or "topic" choice, but if we simply care about time-to-first-token it's still a viable strategy.
Perhaps you could generate a few tokens before the entire model is downloaded, but since every token takes a potentially different "path" through an MoE model, you'd still need to wait for the entire download before getting deeper than a handful of tokens... which is not really a UX improvement imo.