Comment by easygenes

1 year ago

This is neat, but what I really want to see is someone running it on 8x 3090/4090/5090 and what is the most practical configuration for that.

According to NVIDIA. > a single server with eight H200 GPUs connected using NVLink and NVLink Switch can run the full, 671-billion-parameter DeepSeek-R1 model at up to 3,872 tokens per second.

You can rent a single H200 for 3$/hour.

I have been searching for a single example of someone running it like this (or 8x P40 and alike), and found nothing..

8x 3090 will net you around 10-12tok/s

  • It would not be that slow as it is an MoE model with 37b activated parameters.

    Still, 8x3090 gives you ~2.25 bits per weight, which is not a healthy quantization. Doing bifurcation to get up to 16x3090 would be necessary for lightning fast inference with 4bit quants.

    At that point though it becomes very hard to build a system due to PCIE lanes, signal integrity, the volume of space you require, the heat generated, and the power requirements.

    This is the advantage of moving up to Quadro cards, half the power for 2-4x the VRAM (top end Blackwell Quadro expected to be 96GB).

  • Is it possible that eight graphics cards is the most practical configuration? How do you even set that up? I guess server mobos have crazy numbers of PCIe slots?