Comment by stevenhuang

18 hours ago

Current local models already compete.

A Qwen3.6-35B-A3B or whatever it's full name is, when on a 3090, can at the very least, with very little fine tuning, compete with Haiku and blows away GPT4.1 (aka, the cheap models).

It might keep up with Sonnet 4.5 with some tinkering.

But long story short: it seems to have better performance and similar quality for a payoff of a year or so compared to cloud models. In the same way you can self host faster/easier/cheaper than cloud hosting, if you are okay with the negatives.

I'm returning my 3090 soon for a R9700 after some more basic benchmarking, since the higher RAM should improve my observations more.

  • > It might keep up with Sonnet 4.5 with some tinkering.

    I would love to see that. I've been using Qwen3.6 35B and the dense 27B, and they are both too slow with not such great results for agentic coding tasks. It's ok, but not impressive. I had better luck with the BF16 and Q8 than the Q4 from unsloth (really love what unsloth is doing in this space). Another problem I had with Qwen, which I did not ever encounter with Sonnet - even the BF16 gets stuck and needs a "continue task" prompt from time to time, the lower quants are even worse in that regard.

    If you get some interesting results, I would love to read about it!

    • You don't mention runtime, hardware and harness which are critical. The 35B A3B model should be pretty fast, you do need a decent setup but nothing too fancy. I'm using Q8_XL from unslouth with llama.cpp and opencode and it's pretty awesome. I find that opencode drives the model best, it very rarely gets stuck even with a ton of tool calls. I agree it's comparable to Sonnet 4.5 for most tasks. You may also try the Gamma 4 models which are faster but not as good for coding.