Comment by hydra-f

6 hours ago

Vision has become totally underappreciated, whereas I believe it brings important advantages to a model

Also, a big caveat in using Qwen models has always been its speech patterns. I do wonder how Google made the Gemma lineup so good at this

Let's hope Alibaba continues to open source its models

Agreed. Incidentally, in my testing, qwen models (qwen3.6-35b-a3b and earlier 3.5) are WAY better with vision than gemma4-26b-a4b. I would normally want to stick with gemma4 only (I use it for spam filtering), but it just doesn't cut it for vision work, and qwen models do.

  • That has been my experience has well.

    Qwen 3.5/3.6 are far better at vision. Even the 9B model beats Gemma 4 31B in my use case. They describe the scene more accurately and they focus on the important elements like a human would.

    Gemma 4 frequently misses important element, doesn't understand what things are, and is very coy even if you ask for lots of detail. You have to give it hints "hey what's that round thing on the left" to get half decent answers.

    (Yes I did set the min-tokens correctly. I also tested bf16 and Q8 to make sure it wasn't a quant issue.)

    It's unfortunate because Gemma 4 is so so so much better at natural language interactions.

  • > qwen models (qwen3.6-35b-a3b and earlier 3.5) are WAY better with vision than gemma4-26b-a4b

    Can you give an example? And/or is there a benchmark specifically for this?