Comment by MoonGhost
2 months ago
> 16x 3090 system
That's about 5KW of power
> that gets 7 token/s in llama.cpp
Just looking at electricity bill it's cheaper to use API of any major providers.
> If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.
That's interesting, it means the model can be cut and those token routed to another closest expert, just in case they happened.
Or merge the bottom 1/8 (or whatever) experts together and (optionally) do some minimal training with all other weights frozen. Would need to modify the MoE routers slightly to map old -> new expert indices so you don't need to retrain the routers.