← Back to context

Comment by MoonGhost

2 months ago

> 16x 3090 system

That's about 5KW of power

> that gets 7 token/s in llama.cpp

Just looking at electricity bill it's cheaper to use API of any major providers.

> If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.

That's interesting, it means the model can be cut and those token routed to another closest expert, just in case they happened.

Or merge the bottom 1/8 (or whatever) experts together and (optionally) do some minimal training with all other weights frozen. Would need to modify the MoE routers slightly to map old -> new expert indices so you don't need to retrain the routers.