Comment by skiing_crawling
1 day ago
I got an RTX 6000 pro too. I like running locally, I've learned a lot more than if I had used an API and there's less worry about overspending tokens. I accidentally spent $100 on claude api in like 2 days because I didn't know what I was doing.
The problem is that while one these gpus is a huge improvement over a laptop or a single 3090, you very quickly wish you had more. I would buy a second one, but I did the math and realized that with the current crop of models, 2 Blackwells doesn't buy me any new capability that I didn't have with one. So I would need a 3rd one. And when I buy a 3rd one I will feel like I want to running a higher quant, so then I will want a 4th.
You can fit Deepseek 4 Flash on two with TP 2 and 6 different streams at 65k context. 150 tok/s
A pair of RTX6000 cards will give you a good performance boost due to tensor parallelism, though. I haven't tried the newest predictive quants but I see about 35 tps when running the 8-bit Qwen 3.6 27B model on one board and about 50 tps on two. Probably could come close to 100 tps on an optimized setup with the latest GGUFs.
Also, the 4-bit quants of MiniMax 2.7 will run at 100 tps or so with two cards, which is pretty decent. It doesn't go any faster at all with 4 GPUs from what I've seen, so if you don't actively need 384 GB of VRAM, 2x RTX6000 is a good place to be.
You can get 70-80 tps on qwen3.6-27b f16 with MTP on a single card
What kind of machine did you build around it ?
Using an Epyc platform to get plenty of PCIe lanes and memory channels. I have couple of extra 3090s plugged in which get some offload and help with larger models that don't fit entirely on the blackwell.