← Back to context

Comment by sandGorgon

15 hours ago

deepseek kind of innovated on this using off-the-shelf components right ?

to quote from their paper "In order to ensure sufficient computational performance for DualPipe, we customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication. The implementation of the kernels is codesigned with the MoE gating algorithm and the network topology of our cluster."