Comment by EnPissant

1 day ago

You don't. You run some of the layers on the CPU.

8 comments

EnPissant

You're right that I was confused about that.

LM Studio defaults to 12/36 layers on the GPU for that model on my machine, but you can crank it to all 36 on the GPU. That does slow it down but I'm not finding it unusable and it seems like it has some advantages - but I doubt I'm going to run it this way.

EnPissant 15 hours ago
FWIW, that's a 80GB model and you also need kv cache. You'd need 96GBish to run on the GPU.
- furyofantares 15 hours ago
  
  Do you know if it's doing what was described earlier, when I run it with all layers on GPU - paging an expert in every time the expert changes? Each expert is only 5.1B parameters.
  
  5 replies →