Comment by msp26

21 hours ago

Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic.

However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.

13 comments

msp26

srigi 18 hours ago

You could keep multimodal projector (understanding of audio, images & PDFs) in system RAM with `--no-mmproj-offload` in llama.cpp. Of course, then it is not accelerated with GPU, but you save its VRAM.

msp26 7 hours ago

Interesting, I might try that, thanks!

ActorNightly 21 hours ago

Qwen is still better that Gemma though. Also you can tune it more for different tasks, which means that you can prioritize thinking and accuracy versus inference speed.

SwellJoe 19 hours ago
Qwen is better at some things (code, in particular), but Gemma has better prose and better vision. At least, it feels that way to me.
- zobzu 19 hours ago
  
  gemma is also just way faster. i dont wanna wait 10min to get a 5-10% better answer (and sometimes, actually worse answer).
  best is to use your own model router atm, depending on the task
  
  4 replies →
MikeTheGreat 18 hours ago
Genuine question: how do you tune it?
I thought "fine-tuning" meant training it on additional data to add additional facts / knowledge? I might be mistaking your use of the word "tune", though :)
- dr_kiszonka 9 hours ago
  
  You can fine-tune relatively easily in Unsloth Studio.
redman25 19 hours ago

It’s a heck of a lot faster too.
2ndorderthought 20 hours ago

Yes I would just go with qwen.