Qwen3-Coder-Next

8 hours ago (qwen.ai)

This GGUF is 48.4GB - https://huggingface.co/Qwen/Qwen3-Coder-Next-GGUF/tree/main/... - which should be usable on higher end laptops.

I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.

Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next

  • We need a new word, not "local model" but "my own computers model" CapEx based

    This distinction is important because some "we support local model" tools have things like ollama orchestration or use the llama.cpp libraries to connect to models on the same physical machine.

    That's not my definition of local. Mine is "local network". so call it the "LAN model" until we come up with something better. "Self-host" exists but this usually means more "open-weights" as opposed to clamping the performance of the model.

    It should be defined as ~sub-$10k, using Steve Jobs megapenny unit.

    Essentially classify things as how many megapennies of spend a machine is that won't OOM on it.

    That's what I mean when I say local: running inference for 'free' somewhere on hardware I control that's at most single digit thousands of dollars. And if I was feeling fancy, could potentially fine-tune on the days scale.

    A modern 5090 build-out with a threadripper, nvme, 256GB RAM, this will run you about 10k +/- 1k. The MLX route is about $6000 out the door after tax (m3-ultra 60 core with 256GB).

    Lastly it's not just "number of parameters". Not all 32B Q4_K_M models load at the same rate or use the same amount of memory. The internal architecture matters and the active parameter count + quantization is becoming a poorer approximation given the SOTA innovations.

    What might be needed is some standardized eval benchmark against standardized hardware classes with basic real world tasks like toolcalling, code generation, and document procesing. There's plenty of "good enough" models out there for a large category of every day tasks, now I want to find out what runs the best

    Take a gen6 thinkpad P14s/macbook pro and a 5090/mac studio, run the benchmark and then we can say something like "time-to-first-token/token-per-second/memory-used/total-time-of-test" and rate this as independent from how accurate the model was.

    • You can run plenty of models on a $10K machine or even a lot less than that, it all depends how much you want to wait for results. Streaming weights from SSD storage using mmap() is already a reality when running the largest and sparsest models. You can save even more on memory by limiting KV caching at the cost of extra compute, and there may be ways to push RAM savings even higher simply by tweaking the extent to which model activations are recomputed as needed.

      2 replies →

    • For context on what cloud API costs look like when running coding agents:

      With Claude Sonnet at $3/$15 per 1M tokens, a typical agent loop with ~2K input tokens and ~500 output per call, 5 LLM calls per task, and 20% retry overhead (common with tool use): you're looking at roughly $0.05-0.10 per agent task.

      At 1K tasks/day that's ~$1.5K-3K/month in API spend.

      The retry overhead is where the real costs hide. Most cost comparisons assume perfect execution, but tool-calling agents fail parsing, need validation retries, etc. I've seen retry rates push effective costs 40-60% above baseline projections.

      Local models trading 50x slower inference for $0 marginal cost start looking very attractive for high-volume, latency-tolerant workloads.

    • I don't even need "open weights" to run on hardware I own.

      I am fine renting an H100 (or whatever), as long as I theoretically have access to and own everything running.

      I do not want my career to become dependent upon Anthropic.

      Honestly, the best thing for "open" might be for us to build open pipes and services and models where we can rent cloud. Large models will outpace small models: LLMs, video models, "world" models, etc.

      I'd even be fine time-sharing a running instance of a large model in a large cloud. As long as all the constituent pieces are open where I could (in theory) distill it, run it myself, spin up my own copy, etc.

      I do not deny that big models are superior. But I worry about the power the large hyperscalers are getting while we focus on small "open" models that really can't match the big ones.

      We should focus on competing with large models, not artisanal homebrew stuff that is irrelevant.

      5 replies →

    • OOM is a pretty terrible benchmark too, though. You can build a DDR4 machine that "technically" loads 256gb models for maybe $1000 used, but then you've got to account for the compute aspect and that's constrained by a number of different variables. A super-sparse model might run great on that DDR4 machine, whereas a 32b model would cause it to chug.

      There's just not a good way to visualize the compute needed, with all the nuance that exists. I think that trying to create these abstractions are what leads to people impulse buying resource-constrained hardware and getting frustrated. The autoscalers have a huge advantage in this field that homelabbers will never be able to match.

      2 replies →

  • I run Qwen3-Coder-30B-A3B-Instruct gguf on a VM with 13gb RAM and a 6gb RTX 2060 mobile GPU passed through to it with ik_llama, and I would describe it as usable, at least. It's running on an old (5 years, maybe more) Razer Blade laptop that has a broken display and 16gb RAM.

    I use opencode and have done a few toy projects and little changes in small repositories and can get pretty speedy and stable experience up to a 64k context.

    It would probably fall apart if I wanted to use it on larger projects, but I've often set tasks running on it, stepped away for an hour, and had a solution when I return. It's definitely useful for smaller project, scaffolding, basic bug fixes, extra UI tweaks etc.

    I don't think "usable" a binary thing though. I know you write lot about this, but it'd be interesting to understand what you're asking the local models to do, and what is it about what they do that you consider unusable on a relative monster of a laptop?

    • I've had usable results with qwen3:30b, for what I was doing. There's definitely a knack to breaking the problem down enough for it.

      What's interesting to me about this model is how good it allegedly is with no thinking mode. That's my main complaint about qwen3:30b, how verbose its reasoning is. For the size it's astonishing otherwise.

    • Honestly I've been completely spoiled by Claude Code and Codex CLI against hosted models.

      I'm hoping for an experience where I can tell my computer to do a thing - write a code, check for logged errors, find something in a bunch of files - and I get an answer a few moments later.

      Setting a task and then coming back to see if it worked an hour later is too much friction for me!

      1 reply →

  • > I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful

    I've had mild success with GPT-OSS-120b (MXFP4, ends up taking ~66GB of VRAM for me with llama.cpp) and Codex.

    I'm wondering if maybe one could crowdsource chat logs for GPT-OSS-120b running with Codex, then seed another post-training run to fine-tune the 20b variant with the good runs from 120b, if that'd make a big difference. Both models with the reasoning_effort set to high are actually quite good compared to other downloadable models, although the 120b is just about out of reach for 64GB so getting the 20b better for specific use cases seems like it'd be useful.

    • Are you running 120B agentic? I tried using it in a few different setups and it failed hard in every one. It would just give up after a second or two every time.

      I wonder if it has to do with the message format, since it should be able to do tool use afaict.

    • You are describing distillation, there are better ways to do it, and it was done in the past, Deepseek distilled onto Qwen.

    • I’ve a 128GB m3 max MacBook Pro. Running the gpt oss model on it via lmstudio once the context gets large enough the fans spin to 100 and it’s unbearable.

      3 replies →

  • I configured Claude Code to use a local model (ollama run glm-4.7-flash) that runs really well on a 32G M2Pro macmini. Maybe my standards are too low, but I was using that combination to clean up the code, make improvements, and add docs and tests to a bunch of old git repo experiment projects.

  • I wonder if the future in ~5 years is almost all local models? High-end computers and GPUs can already do it for decent models, but not sota models. 5 years is enough time to ramp up memory production, consumers to level-up their hardware, and models to optimize down to lower-end hardware while still being really good.

    • Opensource or local models will always heavily lag frontier.

      Who pays for a free model? GPU training isn't free!

      I remember early on people saying 100B+ models will run on your phone like nowish. They were completely wrong and I don't think it's going to ever really change.

      People always will want the fastest, best, easiest setup method.

      "Good enough" massively changes when your marketing team is managing k8s clusters with frontier systems in the near future.

      9 replies →

    • A lot of manufacturers are bailing on consumer lines to focus on enterprise from what I've read. Not great.

    • Even without leveling up hardware, 5 years is a loooong time to squeeze the juice out of lower-end model capability. Although in this specific niche we do seem to be leaning on Qwen a lot.

  • I can't get Codex CLI or Claude Code to use small local models and to use tools. This is because those tools use XML and the small local models have JSON tool use baked into them. No amount of prompting can fix it.

    In a day or two I'll release my answer to this problem. But, I'm curious, have you had a different experience where tool use works in one of these CLIs with a small local model?

  • I have the same experience with local models. I really want to use them, but right now, they're not on par with propietary models on capabilities nor speed (at least if you're using a Mac).

    • Local models on your laptop will never be as powerful as the ones that take up a rack of datacenter equipment. But there is still a surprising amount of overlap if you are willing to understand and accept the limitations.

  • I'm thinking the next step would be to include this as a 'junior dev' and let Opus farm simple stuff out to it. It could be local, but also if it's on cerebras, it could be realllly fast.

  • Unfortunately Qwen3-next is not well supported on Apple silicon, it seems the Qwen team doesn't really care about Apple.

    On M1 64GB Q4KM on llama.cpp gives only 20Tok/s while on MLX it is more than twice as fast. However, MLX has problems with kv cache consistency and especially with branching. So while in theory it is twice as fast as llama.cpp it often does the PP all over again which completely trashes performance especially with agentic coding.

    So the agony is to decide whether to endure half the possible speed but getting much better kv-caching in return. Or to have twice the speed but then often you have again to sit through prompt processing.

    But who knows, maybe Qwen gives them a hand? (hint,hint)

    • I can run nightmedia/qwen3-next-80b-a3b-instruct-mlx at 60-74 tps using LM Studio. What did you try ? What benefit do you get from KV Caching ?

      1 reply →

    • Any notes on the problems with MLX caching? I’ve experimented with local models on my MacBook and there’s usually a good speedup from MLX, but I wasn’t aware there’s an issue with prompt caching. Is it from MLX itself or LMstudio/mlx-lm/etc?

      1 reply →

  • It works reasonably well for general tasks, so we're definitely getting there! Probably Qwen3 CLI might be better suited, but haven't tested it yet.

  • you do realize claude opus/gpt5 are probably like 1000B-2000B models? So trying to have a model that's < 60B offer the same level of performance will be a miracle...

    • I don't buy this. I've long wondered if the larger models, while exhibiting more useful knowledge, are not more wasteful as we greedily explore the frontier of "bigger is getting us better results, make it bigger". Qwen3-Coder-Next seems to be a point for that thought: we need to spend some time exploring what smaller models are capable of.

      Perhaps I'm grossly wrong -- I guess time will tell.

      6 replies →

    • There is (must be - information theory) a size/capacity efficiency frontier. There is no particular reason to think we're anywhere near it right now.

For those interested, made some Dynamic Unsloth GGUFs for local deployment at https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF and made a guide on using Claude Code / Codex locally: https://unsloth.ai/docs/models/qwen3-coder-next

  • Nice! Getting ~39 tok/s @ ~60% GPU util. (~170W out of 303W per nvtop).

    System info:

        $ ./llama-server --version
        ggml_vulkan: Found 1 Vulkan devices:
        ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
        version: 7897 (3dd95914d)
        built with GNU 11.4.0 for Linux x86_64
    

    llama.cpp command-line:

        $ ./llama-server --host 0.0.0.0 --port 2000 --no-warmup \
        -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
        --jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on \
        --ctx-size 32768

    • What am I missing here? I thought this model needs 46GB of unified memory for 4-bit quant. Radeon RX 7900 XTX has 24GB of memory right? Hoping to get some insight, thanks in advance!

      1 reply →

  • Hi Daniel, I've been using some of your models on my Framework Desktop at home. Thanks for all that you do.

    Asking from a place of pure ignorance here, because I don't see the answer on HF or in your docs: Why would I (or anyone) want to run this instead of Qwen3's own GGUFs?

  • Good results with your Q8_0 version on 96GB RTX 6000 Blackwell. It one-shotted the Flappy Bird game and also wrote a good Wordle clone in four shots, all at over 60 tps. Thanks!

    Is your Q8_0 file the same as the one hosted directly on the Qwen GGUF page?

17t/s on a laptop with 6GB VRAM and DDR5 system memory. Maximum of 100k context window (then it saturates VRAM). Quite amazing, but tbh I'll still use inference providers, because it's too slow and it's my only machine with "good" specs :)

    cat docker-compose.yml
    services:
      llamacpp:
        volumes:
          - llamacpp:/root
        container_name: llamacpp
        restart: unless-stopped
        image: ghcr.io/ggml-org/llama.cpp:server-cuda
        network_mode: host
        command: |
          -hf unsloth/Qwen3-Coder-Next-GGUF:Q4_K_XL --jinja --cpu-moe --n-gpu-layers 999 --ctx-size 102400 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on
    # unsloth/gpt-oss-120b-GGUF:Q2_K
        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: all
                  capabilities: [gpu]

    volumes:
       llamacpp:

I got this running locally using llama.cpp from Homebrew and the Unsloth quantized model like this:

  brew upgrade llama.cpp # or brew install if you don't have it yet

Then:

  llama-cli \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja

That opened a CLI interface. For a web UI on port 8080 along with an OpenAI chat completions compatible endpoint do this:

  llama-server \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja

It's using about 28GB of RAM.

It’s hard to elaborate just how wild this model might be if it performs as claimed. The claims are this can perform close to Sonnet 4.5 for assisted coding (SWE bench) while using only 3B active parameters. This is obscenely small for the claimed performance.

  • I experimented with the Q2 and Q4 quants. First impression is that it's amazing we can run this locally, but it's definitely not at Sonnet 4.5 level at all.

    Even for my usual toy coding problems it would get simple things wrong and require some poking to get to it.

    A few times it got stuck in thinking loops and I had to cancel prompts.

    This was using the recommended settings from the unsloth repository. It's always possible that there are some bugs in early implementations that need to be fixed later, but so far I don't see any reason to believe this is actually a Sonnet 4.5 level model.

    • Wonder where it falls on the Sonnet 3.7/4.0/4.5 continuum.

      3.7 was not all that great. 4 was decent for specific things, especially self contained stuff like tests, but couldn't do a good job with more complex work. 4.5 is now excellent at many things.

      If it's around the perf of 3.7, that's interesting but not amazing. If it's around 4, that's useful.

  • If it sounds too good to be true…

    • There have been advances recently (last year) in scaling deep rl by a significant amount, their announcement is in line with a timeline of running enough experiments to figure out how to leverage that in post training.

      Importantly, this isn’t just throwing more data at the problem in an unstructured way, afaik companies are getting as many got histories as they can and doing something along the lines of, get an llm to checkpoint pull requests, features etc and convert those into plausible input prompts, then run deep rl with something which passes the acceptance criteria / tests as the reward signal.

    • Should be possible with optimised models, just drop all "generic" stuff and focus on coding performance.

      There's no reason for a coding model to contain all of ao3 and wikipedia =)

      9 replies →

    • It literally always is. HN Thought DeepSeek and every version of Kimi would finally dethrone the bigger models from Anthropic, OpenAI, and Google. They're literally always wrong and average knowledge of LLMs here is shockingly low.

      1 reply →

3B active parameters, and slightly worse than GLM 4.7. On benchmarks. That's pretty amazing! With better orchestration tools being deployed, I've been wondering if faster, dumber coding agents paired with wise orchestrators might be overall faster than using the say opus 4.5 on the bottom for coding. At least we might want to deploy to these guys for simple tasks.

  • It's getting a lot easier to do this using sub-agents with tools in Claude. I have a fleet of Mastra agents (TypeScript). I use those agents inside my project as CLI tools to do repetitive tasks that gobble tokens such as scanning code, web search, library search, and even SourceGraph traversal.

    Overall, it's allowed me to maintain more consistent workflows as I'm less dependent on Opus. Now that Mastra has introduced the concept of Workspaces, which allow for more agentic development, this approach has become even more powerful.

  • Time will tell. All this stuff will get more adoption when Anthropic, Google and OpenAI raise prices.

    • They can only raise prices as long as people buy their subscriptions / pay for their api. The Chinese labs are closing in on the SOTA models (I would say they are already there) and offer insane cheap prices for their subscriptions. Vote with your wallet.

Benchmarks using DGX Spark on vLLM 0.15.1.dev0+gf17644344

  FP8: https://huggingface.co/Qwen/Qwen3-Coder-Next-FP8

  Sequential (single request)

    Prompt     Gen     Prompt Processing    Token Gen
    Tokens     Tokens  (tokens/sec)         (tokens/sec)
    ------     ------  -----------------    -----------
       521        49            3,157            44.2
     1,033        83            3,917            43.7
     2,057        77            3,937            43.6
     4,105        77            4,453            43.2
     8,201        77            4,710            42.2

  Parallel (concurrent requests)

    pp4096+tg128 (4K context, 128 gen):

     n    t/s
    --    ----
     1    28.5
     2    39.0
     4    50.4
     8    57.5
    16    61.4
    32    62.0

    pp8192+tg128 (8K context, 128 gen):

     n    t/s
    --    ----
     1    21.6
     2    27.1
     4    31.9
     8    32.7
    16    33.7
    32    31.7

Using lmstudio-community/Qwen3-Coder-Next-GGUF:Q8_0 I'm getting up to 32 tokens/s on Strix Halo, with room for 128k of context (out of 256k that the model can manage).

From very limited testing, it seems to be slightly worse than MiniMax M2.1 Q6 (a model about twice its size). I'm impressed.

  • How's the Strix Halo? I'd really like to get a local inference machine so that I don't have to use quantized versions of local models.

    • Works great for these type of MOE models. The ability to have large amounts of VRAM let you run different models in parallel easily, or to have actually useful context sizes. Dense models can get sluggish though. AMD's ROCm support has been a little rough for Stable Diffusion stuff (memory issues leading to application stability problems) but it's worked well with LLMs, as does Vulkan.

      I wish AMD would get around to adding NPU support in Linux for it though, it has more potential that could be unlocked.

  • I'm getting similar numbers on NVIDIA Spark around 25-30 tokens/sec output, 251 token/sec prompt processing... but I'm running with the Q4_K_XL quant. I'll try the Q8 next, but that would leave less room for context.

    I tried FP8 in vLLM and it used 110GB and then my machine started to swap when I hit it with a query. Only room for 16k context.

    I suspect there will be some optimizations over the next few weeks that will pick up the performance on these type of machines.

    I have it writing some Rust code and it's definitely slower than using a hosted model but it's actually seeming pretty competent. These are the first results I've had on a locally hosted model that I could see myself actually using, though only once the speed picks up a bit.

    I suspect the API providers will offer this model for nice and cheap, too.

    • llama.cpp is giving me ~35tok/sec with the unsloth quants (UD-Q4_K_XL, elsewhere in this thread) on my Spark. FWIW my understanding and experience is that llama.cpp seems to give slight better performance for "single user" workloads, but I'm not sure why.

      I'm asking it to do some analysis/explain some Rust code in a rather large open source project and it's working nicely. I agree this is a model I could possibly, maybe use locally...

      1 reply →

What is the best place to see local model rankings? The benchmarks seem so heavily gamed that I am willing to believe the “objective” rankings are a lie and personal reviews are more meaningful.

Are there any clear winners per domain? Code, voice-to-text, text-to-voice, text editing, image generation, text summarization, business-text-generation, music synthesis, whatever.

Not crazy about it. It keeps getting stuck in a loop and filling up the context window (131k, run locally). Kimi's been nice, even if a bit slow.

As always, the Qwen team is pushing out fantastic content

Hope they update the model page soon https://chat.qwen.ai/settings/model

  • That’s a perfectly fine usage of content (primary substance offered by a “website”)

  • > "content"

    Sorry, but we're talking about models as content now? There's almost always a better word than "content" if you're describing something that's in tech or online.

    • I wasn’t only referring to their new model, I meant their blogpost and the research behind their progress, its always a joyride to read.

      I didn’t know it was this serious with the vocabulary, I’ll be more cautious in the future.

I kind of lost interest in local models. Then Anthropic started saying I’m not allowed to use my Claude Code subscription with my preferred tools and it reminded me why we need to support open tools and models. I’ve cancelled my CC subscription, I’m not paying to support anticompetitive behaviour.

  • > Then Anthropic started saying I’m not allowed to use my Claude Code subscription with my preferred tools

    To be clear, since this confuses a lot of people in every thread: Anthropic will let you use their API with any coding tools you want. You just have to go through the public API and pay the same rate as everyone else. They have not "blocked" or "banned" any coding tools from using their API, even though a lot of the clickbait headlines have tried to insinuate as much.

    Anthropic never sold subscription plans as being usable with anything other than their own tools. They were specifically offered as a way to use their own apps for a flat monthly fee.

    They obviously set the limits and pricing according to typical use patterns of these tools, because the typical users aren't maxing out their credits in every usage window.

    Some of the open source tools reverse engineered the protocol (which wasn't hard) and people started using the plans with other tools. This situation went on for a while without enforcement until it got too big to ignore, and they began protecting the private endpoints explicitly.

    The subscription plans were never sold as a way to use the API with other programs, but I think they let it slide for a while because it was only a small number of people doing it. Once the tools started getting more popular they started closing loopholes to use the private API with other tools, which shouldn't really come as a surprise.

    • > Anthropic will let you use their API with any coding tools you want

      No, in 2026, even with their API plan the create key is disabled for most orgs, you basically have to ask your admin to give you a key to use something other than Claude Code. You can imagine how that would be a problem.

    • The anticompetitive part is setting a much lower price for typical usage of Claude Code vs. typical usage of another CLI dev tool.

      4 replies →

    • from what i remember, i couldnt actually use claude code with the subscription when i subscribed. i could only use it with third party tools.

      eventually they added subscription support and that worked better than cline or kilo, but im still not clear what anthropic tools the subscription was actually useful for

    • The question I pose is this: if they're willing to start building walls this early in the game while they've still got plenty of viable competitors, and are at most 6 months ahead, how will they treat us if they achieve market dominance?

      Some people think LLMs are the final frontier. If we just give in and let Anthropic dictate the terms to us we're going to experience unprecedented enshittification. The software freedom fight is more important than ever. My machine is sovereign; Anthropic provides the API, everything I do on my machine is my concern.

    • I don't get why so much mental gymnastics is done to avoid the fact that locking their lower prices to effectively subsidize their shitty product is the anti competitive behavior.

      They simply don't want to compete, they want to force the majority of people that can't spend a lot on tokens to use their inferior product.

      Why build a better product if you control the cost?

  • You gave up some convenience to avoid voting for a bad practice with your wallet. I admire this, try to consistently do this when reasonably feasible.

    Problem is, most people don't do this, choosing convenience at any given moment without thinking about longer-term impact. This hurts us collectively by letting governments/companies, etc tighten their grip over time. This comes from my lived experience.

    • Society is lacking people that stand up for something. My efforts to consume less is seen as being cheap by my family, which I find so sad. I much prefer donating my money than exchanging superfluous gifts on Christmas.

    • As I get older I more and more view convenience as the enemy of good. Luckily (or unluckily for some) a lot of the tradeoffs we are asked to make in the name of convenience are increasingly absurd. I have an easier and easier time going without these Faustian bargains.

      1 reply →

  • Claude Opus 4.5 by far is the most capable development model. I've been using it mainly via Claude Code, and with Cursor.

    I agree anticompetitive behavior is bad, but the productivity gains to be had by using Anthropic models and tools are undeniable.

    Eventually the open tools and models will catch up, so I'm all for using them locally as well, especially if sensitive data or IP is involved.

    • I'd encourage you to try the -codex family with the highest reasoning.

      I can't comment on Opus in CC because I've never bit the bullet and paid the subscription, but I have worked my way up to the $200/month Cursor subscription and the 5.2 codex models blow Opus out of the water in my experience (obviously very subjective).

      I arrived at making plans with Opus and then implementing with the OpenAI model. The speed of Opus is much better for planning.

      I'm willing to believe that CC/Opus is truly the overall best; I'm only commenting because you mentioned Cursor, where I'm fairly confident it's not. I'm basing my judgement on "how frequently does it do what I want the first time".

      4 replies →

    • > Claude Opus 4.5 by far is the most capable development model.

      At the moment I have a personal Claude Max subscription and ChatGPT Enterprise for Codex at work. Using both, I feel pretty definitively that gpt-5.2-codex is strictly superior to Opus 4.5. When I use Opus 4.5 I’m still constantly dealing with it cutting corners, misinterpreting my intentions and stopping when it isn’t actually done. When I switched to Codex for work a few months ago all of those problems went away.

      I got the personal subscription this month to try out Gas Town and see how Opus 4.5 does on various tasks, and there are definitely features of CC that I miss with Codex CLI (I can’t believe they still don’t have hooks), but I’ve cancelled the subscription and won’t renew it at the end of this month unless they drop a model that really brings them up to where gpt-5.2-codex is at.

      2 replies →

    • It feels very close to a trade-off point.

      I agree with all posts in the chain: Opus is good, Anthropic have burned good will, I would like to use other models...but Opus is too good.

      What I find most frustrating is that I am not sure if it is even actual model quality that is the blocker with other models. Gemini just goes off the rails sometimes with strange bugs like writing random text continuously and burning output tokens, Grok seems to have system prompts that result in odd behaviour...no bugs just doing weird things, Gemini Flash models seem to output massive quantities of text for no reason...it is often feels like very stupid things.

      Also, there are huge issues with adopting some of these open models in terms of IP. Third parties are running these models and you are just sending them all your code...with a code of conduct promise from OpenRouter?

      I also don't think there needs to be a huge improvement in models. Opus feels somewhat close to the reasonable limit: useful, still outputs nonsense, misses things sometimes...there are open models that can reach the same 95th percentile but the median is just the model outputting complete nonsense and trying to wipe your file system.

      The day for open models will come but it still feels so close and so far.

  • I do wonder if they locked things down due to people abusing their CC token.

    • I buy the theory that Claude Code is engineered to use things like token caching efficiently, and their Claude Max plans were designed with those optimizations in mind.

      If people start using the Claude Max plans with other agent harnesses that don't use the same kinds of optimizations the economics may no longer have worked out.

      (But I also buy that they're going for horizontal control of the stack here and banning other agent harnesses was a competitive move to support that.)

      7 replies →

    • In what way would it be abused? The usage limits apply all the same, they aren't client side, and hitting that limit is within the terms of the agreement with Anthropic.

      11 replies →

    • Nah, their "moat" is CC, they are afraid that as other folks build effective coding agent, they are are going lose market share.

    • Taking umbrage as if it matters how I use the compute I'm paying for via the harness they want me to use it within as long as I'm just doing personal tasks I want to do for myself, not trying to power an apps API with it seems such a waste of their time to be focusing on and only causes brand perception damage with their customers.

      Could have just turned a blind eye.

    • How do I "abuse" a token? I pass it to their API, the request executes, a response is returned, I get billed for it. That should be the end of the conversation.

      (Edit due to rate-limiting: I see, thanks -- I wasn't aware there was more than one token type.)

      2 replies →

    • The loss of access shows the kind of power they'll have in the future. It's just a taste of what's to come.

      If a company is going to automate our jobs, we shouldn't be giving them money and data to do so. They're using us to put ourselves out of work, and they're not giving us the keys.

      I'm fine with non-local, open weights models. Not everything has to run on a local GPU, but it has to be something we can own.

      I'd like a large, non-local Qwen3-Coder that I can launch in a RunPod or similar instance. I think on-demand non-local cloud compute can serve as a middle ground.

  • Access is one of my concerns with coding agents - on the one hand I think they make coding much more accessible to people who aren't developers - on the other hand this access is managed by commercial entities and can be suspended for any reason.

    I can also imagine a dysfunctional future where a developers spend half their time convincing their AI agents that the software they're writing is actually aligned with the model's set of values

  • Easy to use a local proxy to use other models with CC. Wrote a basic working one using Claude. LiteLLM is also good. But I agree, fuck their mindset

  • Anthropic banned my account when I whipped up a solution to control Claude Code running on my Mac from my phone when I'm out and about. No commercial angle, just a tool I made for myself since they wouldn't ship this feature (and still haven't). I wasn't their biggest fanboy to begin with, but it gave me the kick in the butt needed to go and explore alternatives until local models get good enough that I don't need to use hosted models altogether.

    • I control it with ssh and sometimes tmux (but termux+wireguard lead to a surprisingly generally stable connection). Why did you need more than that?

      3 replies →

    • How did this work? The ban, I mean. Did you just wake up to find out an email and that your creds no longer worked? Were you doing things to sub-process out to the Claude Code CLI or something else?

      4 replies →

    • There is weaponized malaise employed by these frontier model providers and I feel like that dark-pattern, what you pointed out, and others are employed to rate-limit certain subscriptions.

      9 replies →

  • im downloading it as we speek to try to run it on a 32gb 5090 + 128gb ddr5 i will compare it to glm 4.7-flash that was my local model of choice

  • Did they actually say that? I thought they rolled it back.

    OpenCode et al continue to work with my Max subscription.

  • OpenAI committed to allowing it btw. I don't know why Anthropic gets so much love here

    • Cause they make the best coding model.

      It's that simple. Everyone else is trying to compete in other ways and Anthropic are pushing for dominate the market.

      They'll eventually lose their performance edge and suddenly they will back to being cute and fluffy

      I've cancelled a clause sub, but still have one.

      1 reply →

    • Probably because the alternatives are OpenAI, Google, Meta. Not throwing shade at those companies but it's not hard to win the hearts of developers when that's your competition.

    • Thanks, I’ll try out Codex to bridge until local models get to the level I need.

    • On the other hand I feel like 5.2 gets progressively dumbed down. It used to work well, but now initial few prompts go in right direction and then it goes off the rails reminding me more of a GPT-3.5.

      I wonder what they are up to.

  • What do you require local models to do? The State of Utopia[1] is currently busy porting a small model to run in a zero-trust environment - your web browser. It's finished the port in javascript and is going to wasm now for the CPU path. you can see it being livecoded by Claude right now[2] (this is day 2, day 1 it ported the C++ code to javascript successfully). We are curious to know what permissions you would like to grant such a model and how you would like it served to you. (For example, we consider that you wouldn't trust a Go build - especially if it's built by a nation state, regardless of our branding, practices, members or contributors.)

    Please list what capabilities you would like our local model to have and how you would like to have it served to you.

    [1] a sovereign digital nation built on a national framework rather than a for-profit or even non-profit framework, will be available at https://stateofutopia.com (you can see some of my recent posts or comments here on HN.)

    [2] https://www.youtube.com/live/0psQ2l4-USo?si=RVt2PhGy_A4nYFPi

  • > I’m not paying to support anticompetitive behaviour

    You are doing that all the time. You just draw the line, arbitrarily.

    • That's great, yes. We all draw the line somewhere, subjectively. We all pretend we follow logic and reason and lets all be more honest and truthfully share how we as humans are emotionally driven not logically driven.

      It's like this old adage "Our brains are poor masters and great slaves". We are basically just wanting to survive and we've trained ourselves to follow the orders of our old corporate slave masters who are now failing us, and we are unfortunately out of fear paying and supporting anticompetitive behavior and our internal dissonance is stopping us from changing it (along with fear of survival and missing out and so forth).

      The global marketing by the slave master class isn't helping. We can draw a line however arbitrary we'd like though and its still better and more helpful than complaining "you drew a line arbitrarily" and not actually doing any of the hard courageous work of drawing lines of any kind in the first place.

I just tried qwen 3 tts and it was mind blowingly good, you can even provide directions for the overall tone etc. Which wasn't the case when I used commercial super expensive products like the (now closed after being bought by meta) play.ht .

Does anyone see a reason to still use elevenlabs etc. ?

These guys are setting up to absolutely own the global south market for AI. Which is in line with the belt and road initiative.

So dang exciting! There are a bunch of new interesting small models out lately, by the way, this is just one of them...

I really really want local or self hosted models to work. But my experience is they’re not really even close to the closed paid models.

Does anyone any experience with these and is this release actually workable in practice?

  • > But my experience is they’re not really even close to the closed paid models.

    They are usually as good as the flagship model for 12-18 months ago. Which may sound like a massive difference, because somehow it is, but it's also fairly reasonable, you don't need to live to the bleeding edge.

    • And it's worth pointing out that Claude Code now dispatches "subagents" from Opus->Sonnet and Opus->Haiku ... all the time, depending on the problem.

      Running this thing locally on my Spark with 4-bit quant I'm getting 30-35 tokens/sec in opencode but it doesn't feel any "stupider" than Haiku, that's for sure. Haiku can be dumb as a post. This thing is smarter than that.

      It feels somewhere around Sonnet 4 level, and I am finding it genuinely useful at 4-bit even. Though I have paid subscriptions elsewhere, so I doubt I'll actually use it much.

      I could see configuration OpenCode somehow to use paid Kimi 2.5 or Gemini for the planning/analysis & compaction, and this for the task execution. It seems entirely competent.

Pretty cool that they are advertising OpenClaw compatibility. I've tried a few locally-hosted models with OpenClaw and did not get good results – (that tool is a context-monster... the models would get completely overwhelmed them with erroneous / old instructions.)

Granted these 80B models are probably optimized for H100/H200 which I do not have. Here's to hoping that OpenClaw compat. survives quantization

For someone who is very out of the loop with these AI models, can someone explain what I can actually run on my 3080ti (12G)? Is this something like that or is this still too big; is there anything remotely useful runnable with my GPU? I have 64G RAM if that helps (?).

  • This model does not fit in 12G of VRAM - even the smallest quant is unlikely to fit. However, portions can be offloaded to regular RAM / CPU with a performance hit.

    I would recommend trying llama.cpp's llama-server with models of increasing size until you hit the best quality / speed tradeoff with your hardware that you're willing to accept.

    The Unsloth guides are a great place to start: https://unsloth.ai/docs/models/qwen3-coder-next#llama.cpp-tu...

    • Thanks for the pointers!

      one more thing, that guide says:

      > You can choose UD-Q4_K_XL or other quantized versions.

      I see eight different 4-bit quants (I assume that is the size I want?).. how to pick which one to use?

          IQ4_XS
          Q4_K_S
          Q4_1
          IQ4_NL
          MXFP4_MOE
          Q4_0
          Q4_K_M
          Q4_K_XL

      1 reply →

  • This model is exactly what you’d want for your resources. GPU for prompt processing, ram for model weights and context length, and it being MoE makes it fairly zippy. Q4 is decent; Q5-6 is even better, assuming you can spare the resources. Going past q6 goes into heavily diminishing resources.

Can anyone help me understand the "Number of Agent Turns" vs "SWE-Bench Pro (%)" figure? I.e. what does the spread of Qwen3-Coder-Next from ~50 to ~280 agent turns represent for a fixed score of 44.3%: that sometimes it takes that spread of agent turns to achieve said fixed score for the given model?

  • SWE-Bench Pro consists of 1865 tasks. https://arxiv.org/abs/2509.16941 Qwen3-Coder-Next solved 44.3% (826 or 827) of these tasks. To solve a single task, it took between ≈50 and ≈280 agent turns, ≈150 on average. In other words, a single pass through the dataset took ≈280000 agent turns. Kimi-K2.5 solved ≈84 fewer tasks, but also only took about a third as many agent turns.

    • Ah, a spread of the individual tests makes plenty of sense! Many thanks (same goes to the other comments).

    • If this is genuinely better than K2.5 even at a third the speed then my openrouter credits are going to go unused.

  • Essentially the more turns you have the more the agent is likely to fail since the error compounds per turn. Agentic model are tuned for “long horizon tasks” ie being able to go many many turns on the same problem without failing.

will this run on an apple m4 air with 32gb ram?

Im currently using qwen 2.5 16b , and it works really well

  • No, at Q2 you are looking at a size of about 26gb-30gb. Q3 exceeds it, you might run it, but the result might vary. Best to run a smaller model like qwen3-32b/30b at Q6

Is this going to need 1x or 2x of those RTX PRO 6000s to allow for a decent KV for an active context length of 64-100k?

It's one thing running the model without any context, but coding agents build it up close to the max and that slows down generation massively in my experience.

  • I have a 3090 and a 4090 and it all fits in in VRAM with Q4_0 and quantized KV, 96k ctx. 1400 pp, 80 tps.

  • 1 6000 should be fine, Q6_K_XL gguf will be almost on par with the raw weights and should let you have 128k-256k context.

Does Qwen3 allow adjusting context during an LLM call or does the housekeeping need to be done before/after each call but not when a single LLM call with multiple tool calls is in progress?

  • Not applicable... the models just process whatever context you provide to them, context management happens outside of the model and depends on your inference tool/coding agent.

    • It's interesting how people can be so into LLMs but dont, at the end of the day, understand they're just passing "well formatted" text to a text processor and everything else is build around encoding/decoding it into familiar or novel interfaces & the rest.

      The instability of the tooling outside of the LLM is what keeps me from building anything on the cloud, because you're attaching your knowledge and work flow to a tool that can both change dramatically based on context, cache, and model changes and can arbitrarily raise prices as "adaptable whales" push the cost up.

      Its akin to learning everything about beanie babies in the early 1990's and right when you think you understand the value proposition, suddenly they're all worthless.

      1 reply →

how can anyone keep up with all these releases... what's next? Sonnet 5?

  • Tune it out, come back in 6 months, the world is not going to end. In 6 months, you’re going to change your API endpoint and/or your subscription and then spend a day or two adjusting. Off to the races you go.

  • Pretty much every lab you can think of has something scheduled for february. Gonna be a wild one

  • This is going to be a crazy month because the Chinese labs are all trying to get their releases out prior to their holidays (Lunar New Year / Spring Festival).

    So we've seen a series of big ones already -- GLM 4.7 Flash, Kimi 2.5, StepFun 3.5, and now this. Still to come is likely a new DeepSeek model, which could be exciting.

    And then I expect the Big3, OpenAI/Google/Anthropic will try to clog the airspace at the same time, to get in front of the potential competition.

  • Relatively, it's not that hard. There's like 4-5 "real" AI labs, who altogether manage to announce maybe 3 products max, per-month.

    Compared to RISC core designs or IC optimization, the pace of AI innovation is slow and easy to follow.

Looks great - i'll try to check it out on my gaming PC.

On a misc note: What's being used to create the screen recordings? It looks so smooth!

the qwen website doesn't work for me in safari :(. had to read the announcement in chrome

Is there any online resource tracking local model capability on say... a $2000 64gb memory Mac Mini? I'm getting increasingly excited about the local model space because it offers us a future where we can benefit from LLMs without having to listen to tech CEOs saber rattle about removing America of its jobs so they can get the next fundraising round sorted

We are getting there, as a next step please release something to outperform Opus 4.5 and GPT 5.2 in coding tasks

  • By the time that happens, Opus 5 and GPT-5.5 will be out. At that point will a GPT-5.2 tier open-weights model feel "good enough"? Based on my experience with frontier models, once you get a taste of the latest and greatest it's very hard to go back to a less capable model, even if that less capable model would have been SOTA 9 months ago.

    • I think it depends on what you use it for. Coding, where time is money? You probably want the Good Shit, but also want decent open weights models to keep prices sane rather than sama’s 20k/month nonsense. Something like a basic sentiment analysis? You can get good results out of a 30b MoE that runs at good pace on a midrange laptop. Researching things online with many sources and decent results I’d expect to be doable locally by the end of 2026 if you have 128GB ram, although it’ll take a while to resolve.

      2 replies →

    • When Alibaba succeeds at producing a GPT-5.2-equivalent model, they won't be releasing the weights. They'll only offer API access, like for the previous models in the Qwen Max series.

      Don't forget that they want to make money in the end. They release small models for free because the publicity is worth more than they could charge for them, but they won't just give away models that are good enough that people would pay significant amounts of money to use them.

    • If an open weights model is released that’s as capable at coding as Opus 4.5, then there’s very little reason not to offload the actual writing of code to open weight subagents running locally and stick strictly to planning with Opus 5. Could get you masses more usage out of your plan (or cut down on API costs).

    • I'm going in the opposite direction: with each new model, the more I try to optimize my existing workflows by breaking the tasks down so that I can delegate tasks to the less powerful models and only rely on the newer ones if the results are not acceptable.

    • I used to say that Sonnet 4.5 was all I would ever need, but now I exclusively use Opus...

    • > Based on my experience with frontier models, once you get a taste of the latest and greatest it's very hard to go back to a less capable model, even if that less capable model would have been SOTA 9 months ago.

      That's the tyranny of comfort. Same for high end car, living in a big place, etc.

      There's a good work around though: just don't try the luxury in the first place so you can stay happy with the 9 months delay.

  • I'd be happy with something that's close or same as opus 4.5 that I can run locally, at reasonable (same) speed as claude cli, and at reasonable budget (within $10-30k).

My IT department is convinced these "ChInEsE cCcP mOdElS" are going to exfiltrate our entire corporate network of its essential fluids and vita.. erh, I mean data. I've tried explaining to them that it's physically impossible for model weights to make network requests on their own. Also, what happened to their MitM-style, extremely intrusive network monitoring that they insisted we absolutely needed?

I wonder if we could have much smaller models if they train on less languages? i.e. python + yaml + json only or even an single languages with an cluster of models loaded into memory dynamically...?

The agent orchestration point from vessenes is interesting - using faster, smaller models for routine tasks while reserving frontier models for complex reasoning.

In practice, I've found the economics work like this:

1. Code generation (boilerplate, tests, migrations) - smaller models are fine, and latency matters more than peak capability 2. Architecture decisions, debugging subtle issues - worth the cost of frontier models 3. Refactoring existing code - the model needs to "understand" before changing, so context and reasoning matter more

The 3B active parameters claim is the key unlock here. If this actually runs well on consumer hardware with reasonable context windows, it becomes the obvious choice for category 1 tasks. The question is whether the SWE-Bench numbers hold up for real-world "agent turn" scenarios where you're doing hundreds of small operations.

  • I find it really surprising that you’re fine with low end models for coding - I went through a lot of open-weights models, local and "local", and I consistently found the results underwhelming. The glm-4.7 was the smallest model I found to be somewhat reliable, but that’s a sizable 350b and stretches the definition of local-as-in-at-home.