Comment by tredre3
5 hours ago
That has been my experience has well.
Qwen 3.5/3.6 are far better at vision. Even the 9B model beats Gemma 4 31B in my use case. They describe the scene more accurately and they focus on the important elements like a human would.
Gemma 4 frequently misses important element, doesn't understand what things are, and is very coy even if you ask for lots of detail. You have to give it hints "hey what's that round thing on the left" to get half decent answers.
(Yes I did set the min-tokens correctly. I also tested bf16 and Q8 to make sure it wasn't a quant issue.)
It's unfortunate because Gemma 4 is so so so much better at natural language interactions.
No comments yet
Contribute on Hacker News ↗