Comment by bick_nyers

2 months ago

There's still a lot of opportunity for software optimizations here. Trouble is that really only two classes of systems get optimizations for Deepseek, namely 1 small GPU + a lot of RAM (ktransformers) and the system that has all the VRAM in the world.

A system with say 192GB VRAM and rest standard memory (DGX station, 2xRTX Pro 6000, 4xB60 Dual, etc.) could still in theory run Deepseek @4bit quite quickly because of the power law type usage of the experts.

If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.

This would be an easier job for pruning, but still I think enthusiast systems are going to trend in a way the next couple years that makes these types of software optimizations useful on a much larger scale.

There's a user on Reddit with a 16x 3090 system (PCIE 3.0 x4 interconnect which doesn't seem to be using full bandwidth during tensor parallelism) that gets 7 token/s in llama.cpp. A single 3090 has enough VRAM bandwidth to scan over its 24GB of memory 39 times per second, so there's something else going on limiting performance.

> 16x 3090 system

That's about 5KW of power

> that gets 7 token/s in llama.cpp

Just looking at electricity bill it's cheaper to use API of any major providers.

> If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.

That's interesting, it means the model can be cut and those token routed to another closest expert, just in case they happened.

  • Or merge the bottom 1/8 (or whatever) experts together and (optionally) do some minimal training with all other weights frozen. Would need to modify the MoE routers slightly to map old -> new expert indices so you don't need to retrain the routers.

A single MI300x has 192GB of vram.

  • Sad reality is that the MI300x isn't a monolithic die, so the chiplets have internal bandwidth limitations (ofc less severe that using PCIe/nvlink).

    In AMD own parlance, the "Modular Chiplet Platform" presents itself as either single-I-don't-care-about-speed-or-latency "Single Partition X-celerator" mode or in multiple-I-actually-totally-do-care-about-speed-and-latency-NUMA-like "Core Partitioned X-celerator" mode.

    So you kinda still need to care what-loads-where.

    • I have never heard of a GPU where a deep understanding of how memory is managed was not critical towards getting the best performance.