Comment by maven29

1 day ago

32B active parameters with a single shared expert.

11 comments

maven29

JustFinishedBSG 1 day ago

This doesn’t change the VRAM usage, only the compute requirements.

selfhoster11 1 day ago
It does not have to be VRAM, it could be system RAM, or weights streamed from SSD storage. Reportedly, the latter method achieves around 1 token per second on computers with 64 GB of system RAM.
R1 (and K2) is MoE, whereas Llama 3 is a dense model family. MoE actually makes these models practical to run on cheaper hardware. DeepSeek R1 is more comfortable for me than Llama 3 70B for exactly that reason - if it spills out of the GPU, you take a large performance hit.
If you need to spill into CPU inference, you really want to be multiplying a different set of 32B weights for every token compared to the same 70B (or more) instead, simply because the computation takes so long.
- refulgentis 1 day ago
  
  The amount of people who will be using it at 1 token/sec because there's no better option, and have 64 GB of RAM, is vanishingly small.
  IMHO it sets the local LLM community back when we lean on extreme quantization & streaming weights from disk to say something is possible*, because when people try it out, it turns out it's an awful experience.
  * the implication being, anything is possible in that scenario
  
  4 replies →
maven29 1 day ago
You can probably run this on CPU if you have a 4090D for prompt processing, since 1TB of DDR4 only comes out to around $600.
For GPU inference at scale, I think token-level batching is used.
- zackangelo 1 day ago
  
  Typically a combination of expert level parallelism and tensor level parallelism is used.
  For the big MLP tensors they would be split across GPUs in a cluster. Then for the MoE parts you would spread the experts across the GPUs and route to them based on which experts are active (there would likely be more than one if the batch size is > 1).
- t1amat 1 day ago
  
  With 32B active parameters it would be ridiculously slow at generation.
  
  1 reply →