Comment by beastman82

1 day ago

FWIW I'm running gemma4 31b on my 5090 and it's pretty great as well.

QAT, MTP, 128k context.

I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.

14 comments

beastman82

My experience also aligns with this. I'm running gemma4 31B on a 4090 through llm.cpp with unsloth models. I also run Qwen 3.6. Qwen is good for thinking and planning as it is faster, but Gemma4's generated code is much higher quality in the first try (Rust, C++ and C#). so it needs less revisions to be at a level I'm comfortable for merging.

beastman82 1 day ago
I second unsloth models. I'm using them over blackwell-oriented nvfp4 models as they are (empirically) top quality and performance.
- kroaton 20 hours ago
  
  NVFP4 will be better if the model provider actually post-trained properly after quantizing.
  
  1 reply →

nozzlegear 1 day ago

I can't Gemma4 to actually finish a turn properly, it's always ending abruptly or making malformed tool calls. It's probably something I've misconfigured in oMLX or Opencode.

anon373839 8 hours ago

Same problem with Gemma 4 + oMLX + OpenCode. The thinking and tool calling seems to be parsed fine in other clients such as Open WebUI. This really shouldn’t even matter because the client isn’t responsible for parsing the output, but it’s happening anyway.
acrispino 17 hours ago

possibly a problem with the chat template
https://huggingface.co/google/gemma-4-31B-it/discussions/118
clusterhacks 1 day ago

Huh. Same problem, and I run with llama.cpp. In my case, Gemma4-31B (4-bit quant though) will just stop sometimes.

accrual 1 day ago

Nice. I flip flop between Qwen 3.5 9B Q6_M and Gemma4 12B Q4_K_M on a 4080 Super. They run at about the same speed and I can have them review each other's plan or diffs. For smaller projects I find them very capable, and I can step up to a better quant for slightly more challenging work.

nok22kon 1 day ago
you can probably run Gemma4 26B on your card also at 4 bit. World of a difference compared with 12B.
- zingar 1 day ago
  
  Where does “big model highly quantized” start getting worse than “smaller model less quantized”? Is there a general formula or is it just trial and error?
  
  1 reply →
boppo1 9 hours ago

Have you tried qwen 27b q4_K_XL? It's a little bigger than the 4080 but not too much