← Back to context Comment by jnovek 1 day ago I think that’s only true for MoE models. A dense model like 3.6 27b will require more (plus a KV store). 1 comment jnovek Reply bityard 1 day ago No, even MoE models need to fit into (V)RAM. MoE has faster inference because only a subset of layers are used to predict the next token, but the set of layers used changes with every token.
bityard 1 day ago No, even MoE models need to fit into (V)RAM. MoE has faster inference because only a subset of layers are used to predict the next token, but the set of layers used changes with every token.
No, even MoE models need to fit into (V)RAM. MoE has faster inference because only a subset of layers are used to predict the next token, but the set of layers used changes with every token.