Comment by throwdbaaway

4 months ago

Yours is the only benchmark that puts 35B A3B above 27B. Time for human judgement to verify? For example, if you look at the thinking traces, there might be logical inconsistencies in the prompts, which then tripped up the 27B more when reasoning. This will also be reflected in the score when thinking is disabled, but we can sort of debug with the thinking traces.

3 comments

throwdbaaway

antirez 4 months ago

I inspected manually and indeed the 27B is doing worse, but I believe it could be due to the exact GGUF in the ollama repository and/or with the need of adjusting the parameters. I'll try more stuff.

andhuman 4 months ago
Isn’t llama.cpp’s implementation of Qwen 3.5 better, or am I misinformed?
- antirez 4 months ago
  
  There was a recent fix by ollama and I used it.