Comment by easygenes

1 year ago

This is neat, but what I really want to see is someone running it on 8x 3090/4090/5090 and what is the most practical configuration for that.

6 comments

easygenes

gatienboquet 1 year ago

According to NVIDIA. > a single server with eight H200 GPUs connected using NVLink and NVLink Switch can run the full, 671-billion-parameter DeepSeek-R1 model at up to 3,872 tokens per second.

You can rent a single H200 for 3$/hour.

MaxikCZ 1 year ago

I have been searching for a single example of someone running it like this (or 8x P40 and alike), and found nothing..

deoxykev 1 year ago

8x 3090 will net you around 10-12tok/s

bick_nyers 1 year ago
It would not be that slow as it is an MoE model with 37b activated parameters.
Still, 8x3090 gives you ~2.25 bits per weight, which is not a healthy quantization. Doing bifurcation to get up to 16x3090 would be necessary for lightning fast inference with 4bit quants.
At that point though it becomes very hard to build a system due to PCIE lanes, signal integrity, the volume of space you require, the heat generated, and the power requirements.
This is the advantage of moving up to Quadro cards, half the power for 2-4x the VRAM (top end Blackwell Quadro expected to be 96GB).
- deoxykev 1 year ago
  
  Yeah, there is a clear bottleneck somewhere in llama.cpp. Even high end hardware is struggling to get good numbers. The theoretical limit should be higher, but it's not yet.
  Benchmarks: https://github.com/ggerganov/llama.cpp/issues/11474#issuecom...
rdlw 1 year ago

Is it possible that eight graphics cards is the most practical configuration? How do you even set that up? I guess server mobos have crazy numbers of PCIe slots?