Comment by corey_moncure

2 months ago

If I understand it correctly, the effect of experts is a weighted sum of the individual calculation of each token meeting each expert, where experts to be met by a token are selected on an individual basis. Since a sum is commutative, though, it should be possible to send a large batch of tokens copied to multiple GPUs, where experts are streamed into VRAM, partitioned across GPUs. Then the bottleneck is your PCI-E bandwidth. With 2 GPUs at Gen 4 x16, you should have 60 GB/s of TX bandwidth, allowing you to upload a half precision quant of DeepSeek (about 360 GB) in about 6 seconds.

  1 GPU  -  30 GB/s TX - 12 seconds
  2 GPUs -  60 GB/s TX - 6 seconds
  4 GPUs - 120 GB/s TX - 3 seconds

Then you just optimize your batch size to match the compute time to the upload time of each GPU. The expert calculation results can be retrieved from the GPUs and summed up.