← Back to context

Comment by martin_

1 day ago

how do you low cost run a 1T param model?

12 comments

martin_

Reply

maven29 1 day ago

32B active parameters with a single shared expert.

JustFinishedBSG 1 day ago
This doesn’t change the VRAM usage, only the compute requirements.
- selfhoster11 1 day ago
  
  It does not have to be VRAM, it could be system RAM, or weights streamed from SSD storage. Reportedly, the latter method achieves around 1 token per second on computers with 64 GB of system RAM.
  R1 (and K2) is MoE, whereas Llama 3 is a dense model family. MoE actually makes these models practical to run on cheaper hardware. DeepSeek R1 is more comfortable for me than Llama 3 70B for exactly that reason - if it spills out of the GPU, you take a large performance hit.
  If you need to spill into CPU inference, you really want to be multiplying a different set of 32B weights for every token compared to the same 70B (or more) instead, simply because the computation takes so long.
  
  5 replies →
- maven29 1 day ago
  
  You can probably run this on CPU if you have a 4090D for prompt processing, since 1TB of DDR4 only comes out to around $600.
  For GPU inference at scale, I think token-level batching is used.
  
  3 replies →