← Back to context

Comment by furyofantares

1 day ago

You're right that I was confused about that.

LM Studio defaults to 12/36 layers on the GPU for that model on my machine, but you can crank it to all 36 on the GPU. That does slow it down but I'm not finding it unusable and it seems like it has some advantages - but I doubt I'm going to run it this way.

FWIW, that's a 80GB model and you also need kv cache. You'd need 96GBish to run on the GPU.

  • Do you know if it's doing what was described earlier, when I run it with all layers on GPU - paging an expert in every time the expert changes? Each expert is only 5.1B parameters.

    • ^ Er, misspoke, each expert is at most .9 B parameters there's 128 experts. 5.1 B is number of active parameters (4 experts + some other parameters).

    • It makes absolutely no sense to do what OP described. The decode stage is bottlenecked on memory bandwidth. Once you pull the weights from system RAM, your work is almost done. To then gigabytes of weights PER TOKEN over PCIE to do some trivial computation on the GPU is crazy.

      What actually happens is you run some or all of the MoE layers on the CPU from system RAM. This can be tolerable for smaller MoE models, but keeping it all on the GPU will still be 5-10x faster.

      I'm guessing lmstudio gracefully falls back to running _soemthing_ on the CPU. Hopefully you are running only MoE on the CPU. I've only ever used llama.cpp.

      3 replies →