Comment by mechagodzilla
1 year ago
I can run the 671B-Q8 version of R1 with a big context on a used dual-socket Xeon I bought for about $2k with 768GB of RAM. It gets about 1-1.5 tokens/sec, which is fine to give it a prompt and just come back an hour or so later. To get to many 10s of tokens/sec, you would need >8 GPUs with 80GB of HBM each, and you're probably talking well north of $250k. For the price, the 'used workstation with a ton of DDR4' approach works amazingly well.
No comments yet
Contribute on Hacker News ↗