Comment by lukeschlather
1 year ago
The thing is I'm not that interested in running something that will run on a $4K rig. I'm a little frustrated by articles like this, because they claim to be running "R1" but it's a quantized version and/or it has a small context window... it's not meaningfully R1. I think to actually run R1 properly you need more like $250k.
But it's hard to tell because most of the stuff posted is people trying to do duct tape and bailing wire solutions.
I can run the 671B-Q8 version of R1 with a big context on a used dual-socket Xeon I bought for about $2k with 768GB of RAM. It gets about 1-1.5 tokens/sec, which is fine to give it a prompt and just come back an hour or so later. To get to many 10s of tokens/sec, you would need >8 GPUs with 80GB of HBM each, and you're probably talking well north of $250k. For the price, the 'used workstation with a ton of DDR4' approach works amazingly well.
If you google, there is a $6k setup for the non-quantized version running like 3-4 tps.