← Back to context

Comment by slavboj

2 months ago

It is not "slow and expensive", although it could be "or". You can get 3 tokens / second running on DDR4 memory on a two generation old workstation system that costs ~1K, via llama.cpp .

You’re most likely confusing the real deepseek with a distilled version. Unless you have more than 192Gb of RAM.

  • Workstations routinely accommodate much more than that. The "under $1K" price referred to a 768gb build (12x 64gb sticks on a Skylake based system), you could also do a dual-socket version with twice that, at the cost of messing with NUMA (which could be a pro or a con for throughput depending on how you're spreading bandwidth between nodes).