← Back to context

Comment by camelmel

2 days ago

Huh, according to that model card this is a 137B total parameter model.

Performance doesn't seem that good:

- MAI-Code-1-Flash (137B-A5B) = 51% on SWE-bench pro

- Qwen3.6-35B-A3B = 49.5% on SWE-bench pro (https://huggingface.co/Qwen/Qwen3.6-35B-A3B)

They benchmark against Claude Haiku but Haiku is not good, it's worse than tiny open models you can run locally or via API at 10% the cost.

Dave Citron here, from the MAI team. Thanks for the feedback, we're getting the model card updated to call out 5B active parameters (137B total).

On benchmarks: in the same VS Code harness, MAI-Code-1-Flash scored 51.2% on SWE-bench Pro vs. Haiku's 35.2% which we see as a pretty big leap. But going forward, we'll include additional models in our benchmarks, including models like Qwen 3.6 and Gemma 4.

  • Have you run it through DeepSWE? I understand that's probably a high ask for this class of model, but would be interesting to see regardless.

    Even if it can't fully pass much, there are so many tests against most of the scenarios that you can get a fairly rich report beyond the pass@1 stat. See e.g. this DeepSWE report against the Minimax M3 model: https://entrpi.github.io/misc/deep-swe-minimax-m3/

  • Hey Dave, I’d love to add your new model in the harness I’m going to opensource very soonish. Going to publish benchmarks on real world tasks.

  • Qwen HAS to be a part of the discussion here, even though Microsoft is a US based entity. Their 30b MoE models absolutely hit way above their weight when paired with the right harness program, and can be ran on "Costco gaming computer" specs when configured correctly in llama.cpp.

    Sorry Trump Administration, but while the US has been downloading more ram by throwing data centers at everything and burning up everyone's power and water, China has come out with what's effectively a prototype edge compute capable AI model - regardless of how they built it. And arguably I can tokenmaxx on it just fine at around 30-40 tokens/sec.

    And also, ASICs are on the way. Imagine one of those with a heavy hitting model (MoE or otherwise, Qwen or otherwise) installed in a PCIe slot at 10k+ tokens/sec and 75 watts max (maximum wattage deliverable by the PCIe slot alone) for $300-400 USD each.

    https://taalas.com/the-path-to-ubiquitous-ai/

    ASIC demo here: https://chatjimmy.ai/

    Sorry/not sorry to rip this whole thing to shreds. But I'm sick and tired of these inefficient LLMs being produced that seemingly can only be offered by subscription from a data center, when I'm running a full AI stack right now (model and all) on my computer at home on a 750 watt max power supply. Microsoft really needs to get with the picture here and compete more with Qwen instead of just the US/EU entities.

    Sincerely, your neighbor down in Tacoma. https://www.youtube.com/watch?v=V9jlo4Ht2YA&t=229s

Qwen is definitely the model to beat as of Mid 2026. While I didn't benchmark with SWE as my use cases are OpenClaw [1]. I found both Qwen 3.6 35B A3B and more impressively Qwen 3.5 122B A10B starting to be competitive with closed flash models. The NVFP4 quant of the latter is what I'm running now on DGX.

[1] https://srinathh.medium.com/mid-size-local-models-are-now-co...

  • How does qwen compare to deepseek or kimi? I haven't spent much time with qwen but I find deepseek to be mostly comparable to opus for my pet projects. Kimi k2.6 did a lot of stupid stuff and talked to itself a lot "let me do X... Wait, X doesn't make sense because the user explicitly said Y"

    Deepseek seems to seek first to understand before going off.

    • Deepseek is too large for me to self host on Spark. I was actually using Deepseek as my cloud backup and it performed well but then read the T&C which doesn't give as strong data protection guarantees unlike Google and Alibaba. Kimi is again massive and cloud hosted APIs are fairly expensive compared and it also has weak T&C, so have only benched but not tested. In general I found that with OpenClaw it works better to turn Reasoning off.

      I think there's possibly value to try fine tuning Qwen 3.5 on my OpenClaw turns log to see if performance improves. The one recent model I haven't tested yet is Nemotron 3 Super which I might bench soon.

The take away is that this model is a smaller model that competes with Haiku, I would hope they come out with a "Sonnet" competing model, then Opus. I have been wondering why Microsoft is kind of "sleeping" on offering models they themselves have made on Copilot, maybe it was part of their deal with OpenAI? Not sure.

  • Yes, it's a "smaller" (137B) model that competes with Haiku, but it's basically the performance of Qwen3.6-35B-A3B which is 75% smaller and 98% smaller in terms of active parameters (since it's a mixture of experts model). Microsoft should be comparing its model to good smaller models, not Haiku 4.5.

    Qwen-3.6-27b is closer to Claude Opus 4.7 than it is to Haiku 4.5 in a lot of benchmarks - and it's way smaller than Microsoft's new model.

    Sure, it competes with Haiku, but it shows how far Microsoft is behind lots of other small models that are available.

    • > 98% smaller in terms of active parameters (since it's a mixture of experts model).

      I don’t think that’s right, this flash model is 5B active params. Qwen3.6-35B-A3B is 3B so 40% smaller.

    • I understand what you’re saying, but I am generally very careful when comparing models and their benchmarks; benchmarks often don’t really match “real world” quality.

      2 replies →

So what other models use less than half of Haiku's tokens while providing higher success rate?

  • Why is Haiku the benchmark though, with code generation don't we primarily care about the quality of the code - not the speed or efficiency at which it's generated?

    • You would be surprised how much code haiku writes behind the scenes. With the whole 'plan w/ opus, spawn subagents w/ haiku' that cc does. And you'd be surprised how useful the small models can be under some guidance / hand holding. You can daily-drive gpt5-mini and still find it useful. They're not as good as the big ones, obviously, and can't handle a project start-to-finish on their own, but given a well-scoped task, they'll do it just fine.

    • I'm not sure I follow, but I'll give you a very fresh example.

      I was implementing a re-print functionality in my warehouse management system.

      It took Opus 4.8 high 24m1s and 87k tokens. Took Haiku 6m30s and 41k tokens.

      After that time I had to provide (minor) adjustments to both. But Haiku allowed me to iterate faster. Code quality for that somewhat trivial use case was similar.

      Actually, I would even say that Opus provided a sub par solution: instead of fixing an issue where carrier label pdf wasn't saved as the state machine progressed to the latest step, it went through a much complex solution of re-generating those by scratch. Which is also wrong, as it was de-facto booking the carriers twice for the same order.

      Haiku simply added another field on the terminal state that carried the already generated urls.

      I don't think it's a good idea to default to highest effort/bigger model without taking into account the time it takes and the task complexity.

      Imho we should experiment rather than assume that what the rest of the community does to be the best practice.

      1 reply →

While I agree directionally, I'll caveat that "cost per token" != "cost per task". In the case of Qwen3.6 it tends to think 1.6x more than Haiku, so the cost of Haiku on the same tasks tends to only be about double. More detail from comparing their Artificial Analysis metrics:

  Qwen3.6-35B-A3B   vs   Claude Haiku 4.5
    reasoning mode · AA Intelligence Index v4.0
  
  46.0 ┤   ↖ better — cheaper · smarter · faster
       │
       │
  44.0 ┤     ╭─────╮
       │     │  ●  │ Qwen3.6-35B-A3B
       │     ╰─────╯
  42.0 ┤
       │
       │
  40.0 ┤
       │
       │
  38.0 ┤                                       ╭───╮
       │                      Claude Haiku 4.5 │ ○ │
       │                                       ╰───╯
  36.0 ┤
       └┬─────────┬─────────┬─────────┬─────────┬────────┬
        $200    $300      $400      $500      $600    $700
  
    x → cost to run the index (USD)        lower is better
    y → AA intelligence index              higher is better
  
    bubble area = output speed (tokens / sec)
          ╭─────╮                  ╭───╮
          │  ●  │ Qwen ~196 t/s    │ ○ │ Haiku ~93 t/s
          ╰─────╯                  ╰───╯
  
    ┌─────────────────────┬──────────┬──────────┬───────────┐
    │ model               │ AA index │ run cost │ out speed │
    ├─────────────────────┼──────────┼──────────┼───────────┤
    │ Qwen3.6-35B-A3B    ●│   43.5   │   $280   │  196 t/s  │
    │ Claude Haiku 4.5   ○│   37.1   │   $620   │   93 t/s  │
    └─────────────────────┴──────────┴──────────┴───────────┘


    COST PER TOKEN   ≠   COST PER TASK  
    output tokens per index run:
       Haiku 4.5    87.3M   (79.3M reasoning + 8.0M answer)
       Qwen3.6     143.2M   (131.7M reasoning + 11.5M answer)
       → Qwen emits 1.64× more output
  
    ── output speed (tokens / sec) ──────────  raw rate · higher = faster
       Qwen3.6     100%   ~196 t/s
       Haiku 4.5   ~47%   ~93 t/s
                                                  → Qwen ~2.1× faster per token
  
          ╎   1.64× more tokens  <  2.1× faster rate
          ▼
  
    ── solution speed (per finished answer) ──  higher = faster
       Qwen3.6     100%
       Haiku 4.5   ~78%
                                                  → Qwen ~1.3× FASTER to a solution
  
    SCORECARD
                            intelligence    cost / task     speed to solution
     Qwen3.6-35B-A3B        43.5            $280            ~1.3× faster 
     Claude Haiku 4.5       37.1            $620            (slower)
  
     → Qwen wins all three. The reasoning blow-up (1.64×) is smaller than
       the raw-speed edge (2.1×), so Qwen stays ahead per task.

  • How did you get that nicely formatted graph and table in your post ?!

    • > Text after a blank line that is indented by two or more spaces is formatted as code.

      https://news.ycombinator.com/formatdoc

        crimes ↑
               │
         10.0  ┤                                           ● Airport burger
               │                                      ╭──────────────╮
          8.0  ┤                                      │  theft arc   │
               │                                      ╰──────────────╯
          6.0  ┤                         ● Five Guys
               │
          4.0  ┤              ● Food truck burger
               │
          2.0  ┤      ● McBurger
               │
          0.0  ┤ ● Homemade burger
               │
               └───────┬─────────┬─────────┬─────────┬─────────→ price
                      $2        $8        $14       $22       $38
      
        ┌────────────────────┬────────┬──────────────┬────────────────────┐
        │ burger             │ price  │ crime index  │ expected behavior  │
        ├────────────────────┼────────┼──────────────┼────────────────────┤
        │ Homemade burger    │   $2   │          0.0 │ law-abiding citizen│
        │ McBurger           │   $6   │          1.4 │ steals extra napkin│
        │ Food truck burger  │  $11   │          3.1 │ lies about hunger  │
        │ Five Guys          │  $18   │          6.2 │ financial crime    │
        │ Airport burger     │  $34   │          9.7 │ enters villain arc │
        └────────────────────┴────────┴──────────────┴────────────────────┘
      
        conclusion: burger inflation is a gateway condiment

      15 replies →