Comment by MoonGhost

2 months ago

> 16x 3090 system

That's about 5KW of power

> that gets 7 token/s in llama.cpp

Just looking at electricity bill it's cheaper to use API of any major providers.

> If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.

That's interesting, it means the model can be cut and those token routed to another closest expert, just in case they happened.

1 comment

MoonGhost

bick_nyers 2 months ago

Or merge the bottom 1/8 (or whatever) experts together and (optionally) do some minimal training with all other weights frozen. Would need to modify the MoE routers slightly to map old -> new expert indices so you don't need to retrain the routers.