Comment by phire
14 days ago
BTW, I'd love to see a large model designed from scratch for efficient local inference on low-memory devices.
While current MoE implementations are tuned for load-balancing over large pools of GPUs, there is nothing stopping you tuning them to only switch expert once or twice per token, and ideally keep the same weights across multiple tokens.
Well, nothing stopping you, but there is the question of if it will actually produce a worthwhile model.
Intuitively it feels like there ought to be significant similarities between expert layers because there are fundamentals about processing the stream of tokens that must be shared just from the geometry of the problem. If that's true, then identifying a common abstract base "expert" then specialising the individuals as low-rank adaptations on top of that base would mean you could save a lot of VRAM and expert-swapping. But it might mean you need to train from the start with that structure, rather than it being something you can distil to.
Yes, Deepseek introduced this optimisation of a common base "expert" that's always loaded. Llama 4 uses it too.
I had a sneaking suspicion that I wouldn't be the first to think of it.
DeepSeek introduced novel experts training technique which increased experts specialization. For particular given domain their implementation tends to activate same experts between different tokens, which is kinda what you’re asking for!
I think Gemma 3 is marketed for single GPU setups https://blog.google/technology/developers/gemma-3/