I was getting dangerously close to my weekly Claude Code limit last night so I had Claude set up Qwen3.6 with llama.cpp and OpenCode. Honestly it's a great (free!) alternative to Claude Code--certainly more than good enough for a lot of smaller less complex tasks. I'm excited to try this new version. The fact that open-source models are so close to the frontier is very impressive.
Which exact model are you using? And with which parameters and quant? And on what hardware? Are you using any specific MCPs or other tools to optimize performance like context-mode or dynamic context pruning? I’ve used local models a reasonable amount before but I’m just starting out with opencode. Haven’t had great results yet but really want this to work for simpler tasks. My opencode newly installed is also having iterm on 100% cpu in idle. :/
As they start to release more proprietary models, I so wish that they partnered with one of the major US hyperscalers to allow using these models through something US-domiciled.
Totally understand why it may not be reasonable or in their best interest (and that the US is _absolutely_ not doing the same reflexively). But it would be lovely to be able to try these out on production workloads in earnest.
Unless US hyperscalers do the same in reverse, I hope the status quo stays as it is. Either people are happy to share, and the sharing should happen both ways, or US hyperscalers can keep isolating themselves as they've done so far.
I do hope
The U.S. hyperscalers do the same as well.
In an ideal world U.S. residents would use Chinese AI models and Chinese residents would use U.S. AI models.
Governments in both countries are collecting data for nefarious reasons. But the Chinese government has far less influence on a U.S. resident and vice versa.
We are all better off if our data is collected by a government halfway across the world instead of our own governments which hold incredible amounts of power over us.
I'm more interested in hearing specific reasons why one wouldn't use a Chinese company. Unless you're thinking Alibaba is going to ship chat logs to some government ministry that will then dole out proprietary information to new competitors (which doesn't seem logistically feasible), or you run a human rights organization, it feels a bit like FUD.
All this data is accessible to national security agencies; this is true in every country in the world.
China has more integration between intelligence and industry than many western countries, and it does present a higher risk of unwanted “tech transfer” to industry than running on oracle or Google or ms or Amazon does in the US.
DHS has long staffed full time agents in California to deal with foreign IP exfiltration - using qwen is like fast/easy mode for IP exfiltration: why make anyone get a job in your palo alto office when you can just send it to them in Hanzhou?
Upshot - If you have something proprietary you’re working on I would generally advise not to just direct send it to Alibaba.
>Unless you're thinking Alibaba is going to ship chat logs to some government ministry that will then dole out proprietary information to new competitors (which doesn't seem logistically feasible)
That's exactly the fear, and why would it not be logistically feasible? The threat is definitely a bit overhyped, but China has a longstanding track record of aggressive corporate espionage.
… building and selling a product to US companies that sends company-internal data to Chinese AI providers is not a particularly good way to get people to buy it.
Even if they weren’t individually worried about their proprietary data being shared with Chinese domestic competitors or with government… their audit / security programs likely wouldn’t allow it for a _huge_ range of types of data.
What’s the price point for getting into that sweet spot?
I’m on an M1 Max with 32GB VRAM, so I’m looking forward to the 27B or 35B-A3B models. Is dropping $5k for an RTX 6000 or a DGX Spark really the best option?
Ouch. I'm just getting into tinkering with these things - mine is running on a vanilla gaming desktop with a 12gb 3060 and 32gb of ram. Even going above Qwen 9B risks completely locking up the machine.
These are very good numbers. I still don’t get why they don’t compare against latest competitor versions in these posts, it’s not like we’re all not going to notice.
I find it forgivable if it's within minor version bump. (NB that x.5 is now a defacto major-version bump for LLMs for whatever reason).
Even with LLMs, posts like this don't just fall out of a coconut tree. If you have a set of target benchmarks for your own model, then keeping "the set" of side-by-side comparable models is its own maintenance headache.
I think the argument is that trying to suggest that they’re close to N months from SOTA.
Realistically I assume they hope readers don’t notice the fine details.
The Qwen models are great for open weights but for every past release they haven’t performed as well as the benchmarks in my experience. They’re optimizing for benchmark numbers because they know it works.
honestly, initial version of Opus-4.6 was much better than whatever we are being served right now as 4.7. If it performs same level to that, i'm totally willing to switch.
4.6 was an awful experience the month I used it right after launch where it didn't ask anything just made assumptions and went on its merry way. 4.5 and 4.7 don't do that for me but 4.7 eats my quota for breakfast so I've been avoiding using it because I like to have it for more than an hour a day.
The pattern I trust most is adding a small verification artifact after every external action. Agents usually fail from silent state drift faster than from lack of reasoning depth.
I'm running Qwen 3.6 27B Q5 K M GGUF on a Tesla P40 and koboldcpp using pi.dev as the harness, I gotta say I am impressed. Took some setup and configuring but I already have some code it has made commited and pushed. It can be slow on my hardware at >50k tokens, but the fact I bought this one P40 for like $150 back when the LLM trend started I can't complain. (I have a second one too but I couldn't physically fit the card in my server unfortunately.)
The setup I had to do was important and I had to compile koboldcpp with a few special params for my hardware, I mostly just had Claude figure it out. I don't remember everything I did now but it was very slow and would often stop mid task, it seems it was mostly a parsing issue. It made the model seem broken/dumb, but once I had all that settled I actually am able to use this how I use Claude Code. Disclaimer, I am pretty explicit with requirements, I imagine this fails more when you leave it to figure out things on its own but for my flow its pretty rad.
Currently setting it up as an automated agent now to pull Trello cards, create PRs for them, and move the card to be reviewed.
Command I am using to run:
python koboldcpp.py \
--port 61514 --quiet --multiuser --gpulayers 999 --contextsize 262144 --quantkv 2 \
--usecublas normal --threads 4 --jinja --jinja_tools --jinja_kwargs '{"enable_thinking":true, "preserve_thinking":false}' \
--skiplauncher --model /data/models/Qwen3.6-27B-Q5_K_M.gguf --smartcache 5
I'm using the pi-mono coding agent (open source, free) without any extensions and very simple prompts. The 3.6 27B model (BF16, 250k context) uses 67GB VRAM on an RTX PRO 9000.
It's very capable on almost any coding task I've thrown at it, and very good for easy-to-medium hard scripts, new code bases.
It struggles on some complex tasks in larger code bases, e.g. using to debug and fix bugs in llama.cpp it gets close to working code but often introduces errors. For such tasks its still very useful as a search/explore tool and drafting fixes.
The tokenomics and value for capability, context and latency look like they could deliver super competitive offer - what would it take for you to switch??
Some of it’s probably timing. Some of it is wanting to look good. That said, I just went to the claw-eval site, and neither 4.7 nor 5.5 from oAI are listed on the benchmarks. So there’s also just the time from others to get benchmarking done and published.
Opus-4.6 was probably the best model so far before it got nerfed. 4.7 is nowhere near experience I had. In fact I stopped using it completely because more often than not its output is just dumber than local models.
I can't bring myself to use any model that trains or sends telemetry back to my country's primary competitor/adversary. I don't care how much money is saved.
This is the current European modus operandi: virtue signal and cry about tech that other countries produce, pass local laws that limit its use in their countries even though they have no viable local alternatives, brag amongst themselves about decoupling from US and Chinese tech, and then look on wistfully as the rest of the world moves on without a single fuck given.
Europe's sense of superiority and actual global importance/relevance is assbackwards.
Can anyone check its knowledge base for me? I’m honestly not able to run it and the Qwen models I can run censor information critical towards the Chinese government.
What do you mean? This is not self hosted, it's closed source. And any website that targets China or is hosted in China will probably censor Tiananmen Square.
I was getting dangerously close to my weekly Claude Code limit last night so I had Claude set up Qwen3.6 with llama.cpp and OpenCode. Honestly it's a great (free!) alternative to Claude Code--certainly more than good enough for a lot of smaller less complex tasks. I'm excited to try this new version. The fact that open-source models are so close to the frontier is very impressive.
Which exact model are you using? And with which parameters and quant? And on what hardware? Are you using any specific MCPs or other tools to optimize performance like context-mode or dynamic context pruning? I’ve used local models a reasonable amount before but I’m just starting out with opencode. Haven’t had great results yet but really want this to work for simpler tasks. My opencode newly installed is also having iterm on 100% cpu in idle. :/
Qwen Max are usually closed, unfortunately.
The non-hallucination rate in AA-omniscience is SOTA, better than Opus 4.7, Gemini 3.1 Pro and GPT5.5! Congrats to the team
> The non-hallucination rate in AA-omniscience is SOTA
Note that a perfect "non-hallucination rate" score is rather meaningless as the test itself contains human hallucinations.
It just means the model aligns with the semi-true, semi-false beliefs of the group that made the test.
referencing this:
https://artificialanalysis.ai/evaluations/omniscience?models...
(had to add it to the chart, wasn't displayed by default. is it the lowest rate in the datasetor no?)
Truly incredible! Very impressed by their progress. I wonder how much of their own chips did they use for training.
wonder at which level there's a capability state transition? 5%? 1%?
As they start to release more proprietary models, I so wish that they partnered with one of the major US hyperscalers to allow using these models through something US-domiciled.
Totally understand why it may not be reasonable or in their best interest (and that the US is _absolutely_ not doing the same reflexively). But it would be lovely to be able to try these out on production workloads in earnest.
Unless US hyperscalers do the same in reverse, I hope the status quo stays as it is. Either people are happy to share, and the sharing should happen both ways, or US hyperscalers can keep isolating themselves as they've done so far.
I do hope The U.S. hyperscalers do the same as well.
In an ideal world U.S. residents would use Chinese AI models and Chinese residents would use U.S. AI models.
Governments in both countries are collecting data for nefarious reasons. But the Chinese government has far less influence on a U.S. resident and vice versa.
We are all better off if our data is collected by a government halfway across the world instead of our own governments which hold incredible amounts of power over us.
17 replies →
fireworks hosts Qwen 3.6 Plus, they might also get Qwen 3.7 Plus.
I'm more interested in hearing specific reasons why one wouldn't use a Chinese company. Unless you're thinking Alibaba is going to ship chat logs to some government ministry that will then dole out proprietary information to new competitors (which doesn't seem logistically feasible), or you run a human rights organization, it feels a bit like FUD.
All this data is accessible to national security agencies; this is true in every country in the world.
China has more integration between intelligence and industry than many western countries, and it does present a higher risk of unwanted “tech transfer” to industry than running on oracle or Google or ms or Amazon does in the US.
DHS has long staffed full time agents in California to deal with foreign IP exfiltration - using qwen is like fast/easy mode for IP exfiltration: why make anyone get a job in your palo alto office when you can just send it to them in Hanzhou?
Upshot - If you have something proprietary you’re working on I would generally advise not to just direct send it to Alibaba.
> Unless you're thinking Alibaba is going to ship chat logs to some government ministry
This made me think of a Seinfeld episode: "I didn't know it was possible not to know that."
>Unless you're thinking Alibaba is going to ship chat logs to some government ministry that will then dole out proprietary information to new competitors (which doesn't seem logistically feasible)
That's exactly the fear, and why would it not be logistically feasible? The threat is definitely a bit overhyped, but China has a longstanding track record of aggressive corporate espionage.
… building and selling a product to US companies that sends company-internal data to Chinese AI providers is not a particularly good way to get people to buy it.
Even if they weren’t individually worried about their proprietary data being shared with Chinese domestic competitors or with government… their audit / security programs likely wouldn’t allow it for a _huge_ range of types of data.
Because my CEO thinks China scary big hacker guys over there
US hyperscalers, all of them, are financially invested in the US AI labs and have the incentives to keep the status quo.
ChatLLM support QWEN, do you consider this as US safe?
Is this one of those ones where they'll drop the huggingface release a week later? Or do we know for sure that this is staying proprietary?
someone correct if i'm wrong, but I think the max models are usually non-open
The plus and max models have never been open as far as I know.
1 reply →
Looking forward to more open weight releases from Qwen, especially 122B and 397B.
Yeah that 60-150b~ range is such a sweet spot for current 'prosumer' hardware, I'd love to see something like a 120b-a14b or there about.
I have a 128G mac studio and even 397B was a happy surprise to me due to its high quantization resilience.
I've created a 2.54BPW quant that fit on my hardware with 128k context, 20 tps tg and 200tps pp, while maintaining high scores on many benchmarks: https://huggingface.co/tarruda/Qwen3.5-397B-A17B-GGUF/discus...
8 replies →
What’s the price point for getting into that sweet spot?
I’m on an M1 Max with 32GB VRAM, so I’m looking forward to the 27B or 35B-A3B models. Is dropping $5k for an RTX 6000 or a DGX Spark really the best option?
18 replies →
I'm more excited for qwen3.7 9b and 72b, these are usually so good for their size
I am still waiting for qwem image-edit 2.0 open weight
Ouch. I'm just getting into tinkering with these things - mine is running on a vanilla gaming desktop with a 12gb 3060 and 32gb of ram. Even going above Qwen 9B risks completely locking up the machine.
These are very good numbers. I still don’t get why they don’t compare against latest competitor versions in these posts, it’s not like we’re all not going to notice.
I find it forgivable if it's within minor version bump. (NB that x.5 is now a defacto major-version bump for LLMs for whatever reason).
Even with LLMs, posts like this don't just fall out of a coconut tree. If you have a set of target benchmarks for your own model, then keeping "the set" of side-by-side comparable models is its own maintenance headache.
I think the argument is that trying to suggest that they’re close to N months from SOTA.
Realistically I assume they hope readers don’t notice the fine details.
The Qwen models are great for open weights but for every past release they haven’t performed as well as the benchmarks in my experience. They’re optimizing for benchmark numbers because they know it works.
> Realistically I assume they hope readers don’t notice the fine details.
The pool of people reading such articles while ignoring such details can't be big.
1 reply →
I think its part of the expectation setting (with a side of we did our distillation/ eval harness on a specific model).
if they say it's 4.7 comparable, it anchors that into your head as the model to evaluate against.
honestly, initial version of Opus-4.6 was much better than whatever we are being served right now as 4.7. If it performs same level to that, i'm totally willing to switch.
4.6 was an awful experience the month I used it right after launch where it didn't ask anything just made assumptions and went on its merry way. 4.5 and 4.7 don't do that for me but 4.7 eats my quota for breakfast so I've been avoiding using it because I like to have it for more than an hour a day.
2 replies →
this puzzles me too, I want to know
Marketing.
[dead]
The pattern I trust most is adding a small verification artifact after every external action. Agents usually fail from silent state drift faster than from lack of reasoning depth.
Can you go into more depth about this
QWEN really hits the sweet spot it's cheap, fast, and actually good.
Any reports from people using their coding agent(s)?
I'm running Qwen 3.6 27B Q5 K M GGUF on a Tesla P40 and koboldcpp using pi.dev as the harness, I gotta say I am impressed. Took some setup and configuring but I already have some code it has made commited and pushed. It can be slow on my hardware at >50k tokens, but the fact I bought this one P40 for like $150 back when the LLM trend started I can't complain. (I have a second one too but I couldn't physically fit the card in my server unfortunately.)
The setup I had to do was important and I had to compile koboldcpp with a few special params for my hardware, I mostly just had Claude figure it out. I don't remember everything I did now but it was very slow and would often stop mid task, it seems it was mostly a parsing issue. It made the model seem broken/dumb, but once I had all that settled I actually am able to use this how I use Claude Code. Disclaimer, I am pretty explicit with requirements, I imagine this fails more when you leave it to figure out things on its own but for my flow its pretty rad.
Currently setting it up as an automated agent now to pull Trello cards, create PRs for them, and move the card to be reviewed.
Command I am using to run: python koboldcpp.py \ --port 61514 --quiet --multiuser --gpulayers 999 --contextsize 262144 --quantkv 2 \ --usecublas normal --threads 4 --jinja --jinja_tools --jinja_kwargs '{"enable_thinking":true, "preserve_thinking":false}' \ --skiplauncher --model /data/models/Qwen3.6-27B-Q5_K_M.gguf --smartcache 5
Qwen recommends to preserve_thinking: true for agentic/coding workloads.
I'm using the pi-mono coding agent (open source, free) without any extensions and very simple prompts. The 3.6 27B model (BF16, 250k context) uses 67GB VRAM on an RTX PRO 9000.
It's very capable on almost any coding task I've thrown at it, and very good for easy-to-medium hard scripts, new code bases.
It struggles on some complex tasks in larger code bases, e.g. using to debug and fix bugs in llama.cpp it gets close to working code but often introduces errors. For such tasks its still very useful as a search/explore tool and drafting fixes.
The tokenomics and value for capability, context and latency look like they could deliver super competitive offer - what would it take for you to switch??
It is super strange that all last (3?) releases they keep comparing older models such as Opus-4.6.
Some of it’s probably timing. Some of it is wanting to look good. That said, I just went to the claw-eval site, and neither 4.7 nor 5.5 from oAI are listed on the benchmarks. So there’s also just the time from others to get benchmarking done and published.
Because these can’t compete with the SoTA but they’re close.
Opus-4.6 was probably the best model so far before it got nerfed. 4.7 is nowhere near experience I had. In fact I stopped using it completely because more often than not its output is just dumber than local models.
Same here. Can't stand 4.7.
Any info on pricing and latency?
Does anyone have experience with the Alibaba Cloud Model Studio that serves these qwen models?
[flagged]
[dead]
[flagged]
[dead]
I can't bring myself to use any model that trains or sends telemetry back to my country's primary competitor/adversary. I don't care how much money is saved.
That is understandable. Just don't do it. No need to announce it.
As somebody in Europe, uh, that doesn't leave many options.
This is the current European modus operandi: virtue signal and cry about tech that other countries produce, pass local laws that limit its use in their countries even though they have no viable local alternatives, brag amongst themselves about decoupling from US and Chinese tech, and then look on wistfully as the rest of the world moves on without a single fuck given.
Europe's sense of superiority and actual global importance/relevance is assbackwards.
2 replies →
Can anyone check its knowledge base for me? I’m honestly not able to run it and the Qwen models I can run censor information critical towards the Chinese government.
Tiananmen Square is the first place to start.
> I’m honestly not able to run it
What do you mean? This is not self hosted, it's closed source. And any website that targets China or is hosted in China will probably censor Tiananmen Square.
There is no reason why they couldn't license the model to Friendli/Fireworks/etc and have it hosted in the US to alleviate this concern.
2 replies →