Comment by throwdbaaway
6 days ago
Yours is the only benchmark that puts 35B A3B above 27B. Time for human judgement to verify? For example, if you look at the thinking traces, there might be logical inconsistencies in the prompts, which then tripped up the 27B more when reasoning. This will also be reflected in the score when thinking is disabled, but we can sort of debug with the thinking traces.
I inspected manually and indeed the 27B is doing worse, but I believe it could be due to the exact GGUF in the ollama repository and/or with the need of adjusting the parameters. I'll try more stuff.
Isn’t llama.cpp’s implementation of Qwen 3.5 better, or am I misinformed?
There was a recent fix by ollama and I used it.