Comment by Confiks
10 hours ago
> The same Gemma 4 MoE model (Q4)
As you have so much RAM I would suggest running Q8_0 directly. It's not slower (perhaps except for the initial model load), and might even be faster, while being almost identical in quality to the original model.
And just to be sure: you're are running the MLX version, right? The mlx-community quantization seemed to be broken when I tried it last week (it spit out garbage), so I downloaded the unsloth version instead. That too was broken in mlx-lm (it crashed), but has since been fixed on the main branch of https://github.com/ml-explore/mlx-lm.
I unfortunately only have 16 GiB of RAM on a Macbook M1, but I just tried to run the Q8_0 GGUF version on a 2023 AMD Framework 13 with 64 GiB RAM just using the CPU, and that works surprisingly well with tokens/s much faster than I can read the output. The prompt cache is also very useful to quickly insert a large system prompt or file to datamine although there are probably better ways to do that instead of manually through a script.
> That too was broken in mlx-lm (it crashed), but has since been fixed on the main branch
Unfortunately I have got zero success running gemma with mlx-lm main branch. Can you point me out what is the right way? I have zero experience with mlx-lm.
Gemma 4 is not supported by the MLX engine yet.
> As you have so much RAM I would suggest running Q8_0 directly
On the 48GB mac - absolutely. The 24GB one cannot run Q8, hence why the comparison.
> And just to be sure: you're are running the MLX version, right?
Nah, not yet. I have only tested in LM Studio and they don't have MLX versions recommended yet.
> but has since been fixed on the main branch
That's good to know, I will play around with it.