Comment by antirez

6 days ago

My private benchmarks, using DeepSeek replies to coding problems as a baseline, with Claude Opus as judge. However when reading this percentages consider that the no-think setup is much faster, and may be more practical for most situations.

    1   │ DeepSeek API -- 100%
    2   │ qwen3.5:35b-a3b-q8_0 (thinking) -- 92.5%
    3   │ qwen3.5:35b-a3b-q4_K_M (thinking) -- 90.0%
    4   │ qwen3.5:35b-a3b-q8_0 (no-think) -- 81.3%
    5   │ qwen3.5:27b-q8_0 (thinking) -- 75.3%

I expected the 27B dense model to score higher. Disclaimer: those numbers are from one-shot replies evaluations, the model was not put in a context where it could reiterate as an agent.

Yours is the only benchmark that puts 35B A3B above 27B. Time for human judgement to verify? For example, if you look at the thinking traces, there might be logical inconsistencies in the prompts, which then tripped up the 27B more when reasoning. This will also be reflected in the score when thinking is disabled, but we can sort of debug with the thinking traces.

  • I inspected manually and indeed the 27B is doing worse, but I believe it could be due to the exact GGUF in the ollama repository and/or with the need of adjusting the parameters. I'll try more stuff.

Maybe a reductive question but are there any thinking models that don't (relatively) add much latency?

  • The whole point of thinking is to throw more compute/tokens at a problem, so it will always add latency over non thinking modes/models. Many models do support variable thinking levels or thinking token budgets though, so you can set them to low/minimal thinking if you want only a minimal increase in latency versus no thinking.