← Back to context

Comment by refibrillator

14 days ago

> the actual processing happens in 17B

This is a common misconception of how MoE models work. To be clear, 17B parameters are activated for each token generated.

In practice you will almost certainly be pulling the full 109B parameters though the CPU/GPU cache hierarchy to generate non-trivial output, or at least a significant fraction of that.

I agree the OP’s description is wrong. That said, I think his conclusions are right, in that a quant of this that fits in 512GB of RAM is going to run about 8x faster than a quant of a dense model that fits in the same RAM, esp. on Macs as they are heavily throughput bound.

For all intents and purposes cache may not exist when the working set is 17B or 109B parameters. So it's still better that less parameters are activated for each token. 17B parameters works ~6x faster than 109B parameters just because less data needs to be loaded from RAM.

  • Yes loaded from RAM and loaded to RAM are the big distinction here.

    It will still be slow if portions of the model need to be read from disk to memory each pass, but only having to execute portions of the model for each token is a huge speed improvement.