Qwen3-Next

18 hours ago (qwen.ai)

213 comments

tosh

Coolest part of Qwen3-Next, in my opinion, (after the linear attention parts) is that they do MTP without adding another un-embedding matrix.

Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...

But Deepseek R1 adds embed_tokens and shared_head.head tensors, which are [129280, 7168] or about 2GB in size at FP8.

Qwen3-Next doesn't have that: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob...

So it saves a few GB in active parameters for MTP, which is a Big Deal. This is one of the changes that helps significantly speeds up inference.

puilp0502 16 hours ago
What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?
- jychang 15 hours ago
  
  Speculative decoding! It makes inference a LOT faster.
  Instead of generating tokens one at a time, you generate the second one as well, and then use speculative decoding on that second token (instead of having it be produced by a draft model like Qwen 0.6b). If the token is checked and is correct, then the 2nd token gets generated MUCH faster.
  If it's wrong, you have to generate it again the normal way (a lot slower than just checking it). Usually, it's correct, so inference is a lot faster.
  
  14 replies →
- cubefox 10 hours ago
  
  > What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?
  It is only useful for inference and doesn't help with pretraining. Which actually points to speculative decoding not being sufficiently general, as the same underlying property (some sequences of tokens are easy to predict) could be exploited for training as well. See here: https://goombalab.github.io/blog/2025/hnet-future/#d-footnot...
- rfoo 15 hours ago
  
  It could be a better draft model than separately trained EAGLE etc for speculative decoding.
humblyCrazy 10 hours ago

How is MTP different from Medusa heads? Also does this mean this model comes "natively" with speculative decoding - meaning if I use this model in vllm, it's throughput should be higher because it is already doing MTP so it should be able to take advantages of speculative decoding?
Razengan 12 hours ago
Could someone kindly point to a convenient all-on-one ELI5 of all these words? :')
- lcnPylGDnU4H9OF 11 hours ago
  
  The best primer I've seen is Andrej Karpathy's first video in his "zero to hero" series. It's worth following along with your own practice.
  https://karpathy.ai/zero-to-hero.html
  
  2 replies →
- vessenes 11 hours ago
  
  Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate. If you want to understand what's going on, I think the best thing to do is some intro courses, train and design some smaller models directly, get a list of core papers and concepts from Claude/Chat/Gemini, and then as you read something like this, if you don't know the acronym (In this case: MTP = Multi Token Prediction), search it up, and see if you have the basis for understanding what it's about. If not read up on the precursors.
  Unlike many disciplines, AI is an arena that doesn't have a lot of intuitive simplified models that are accurate -- most of the simplified models available do not accurately describe what's going on enough to reason about and understand them. So, you just have to start reading!
  
  1 reply →
- porridgeraisin 11 hours ago
  
  Background:
  LLMs take your input, upscale it into a very high dimensional space, and then downscale it back to 1D at the end. This 1D list is interpreted as a list of probabilities -- one for each word in your vocabulary. i.e f(x) = downscale(upscale(x)). Each of downscale() and upscale() are parameterized (billions of params). I see you have a gamedev background, so as an example: bezier curves are parameterized functions where bezier handles are the parameters. During training, these parameters are continuously adjusted so that the output of the overall function gets closer to the expected result. Neural networks are just really flexible functions for which you can choose parameters to get any expected result, provided you have enough of them (similar to bezier curves in this regard).
  ---
  When training, you make an LLM learn that
  I use arch = downscale(upscale(I use))
  If you want to predict the next word after that, you do next in sequence the following:
  I use arch btw = downscale(upscale(I use arch))
  Now, multi-token prediction is having two downscale functions, one for each of the next two words, and learning it that way, basically, you have a second downscale2() that learns how to predict the next-to-next word.
  i.e in parallel:
  I use arch = downscale1(upscale(I use))
  I use ____ btw = downscale2(upscale(I use))
  However, this way you'll need twice the number of parameters downscale needs. And if you want to predict more tokens ahead you'll need even more parameters.
  What Qwen has done, is instead of downscale1 and downscale2 being completely separately parameterized functions, they set downscale1(.) = lightweight1(downscale_common(.)) and downscale2(.) = lightweight2(downscale_common(.)). This is essentially betting that a lot of the logic is common and the difference between predicting the next and next-to-next token can be captured in one lightweight function each. Lightweight here, means less parameters. The bet paid off.
  So overall, you save params.
  Concretely,
  Before: downscale1.params + downscale2.params
  After: downscale_common.params + lightweight1.params + lightweight2.params
  Edit: its actually downscale_common(lightweight()) and not the other way around as I have written above. Doesn't change the crux of the answer, but just including this for clarity.
  
  6 replies →
- wickedsight 12 hours ago
  
  For me, ChatGPT or any of the other current thinking models are very useful for this type of stuff. I just ask to explain it on my level and then I can ask questions for clarification.
- pmarreck 10 hours ago
  
  The following was generated by chatG5:
  Qwen3-Next — A family of large language models from Qwen (Alibaba). DeepSeek R1 — Another large open-source language model from DeepSeek AI. Linear attention — A type of transformer attention that scales linearly with sequence length, making long-context processing cheaper. MTP (Multi-Token Prediction) — Training/inference trick where the model predicts multiple future tokens at once, speeding things up. Embedding — Converts words/tokens into vectors (numbers) the model can work with. Un-embedding — The reverse step: mapping the model’s internal vector back into tokens. embed_tokens — The big lookup table of embeddings (token → vector). shared_head.head tensors — Extra weight matrices used for prediction; they can be huge. [129280, 7168] — The shape of such a tensor: ~129k rows (tokens in the vocab) × 7k columns (hidden dimension). FP8 — Floating-point format using 8 bits (compact, faster, less precise). Active parameters — The weights that actually need to be loaded in GPU memory to run the model. Inference — Running the model to generate text (as opposed to training it). GB savings — If you avoid duplicating giant matrices, you save GPU memory and speed things up.

Alifatisk 13 hours ago

Alibaba keeps releasing gold content

I just tried Qwen3-Next-80B-A3B on Qwen chat, and it's fast! The quality seem to match Qwen3-235B-A22B. Quite impressive how they achieved this. Can't wait for the benchmarks at Artificial analysis

According to Qwen Chat, Qwen3-Next has the following limits:

Maximum context length: 262,144 tokens

Max summary generation length: 32,768 tokens

This is 2x higher on context length and 4x higher on summary generation compared to Qwen3-235B-A22B, damn

> Qwen3-Next [...] excels in ultra-long-context understanding and complex tasks

Even though their new hybrid architecture is fascinating, I think I'll continue to stick with Qwen2.5-Turbo because it's one of the few models that supports 1M tokens in context length. My use case is uploading large pdfs and ask questions across chapters

gizmodo59 13 hours ago
My take on long context for many frontier models is not about support but the accuracy drops drastically as you increase the context. Even if a model claims to support 10M context, reality is it doesn’t perform well when you saturate. Curious to hear others perspective on this
- vessenes 11 hours ago
  
  Agreed. That said, in general a 1M context model has a larger usable window than a 260k context model.
- kridsdale3 12 hours ago
  
  This is my experience with Gemini. Yes, I really can put an entire codebase and all the docs and pre-dev discussions and all the inter-engineer chat logs in there.
  I still see the model becoming more intoxicated as turn count gets high.
  
  1 reply →
pilotneko 12 hours ago
If you read the model card, Qwen3-Next can be extended to 1M context length with YaRN.
> Qwen3-Next natively supports context lengths of up to 262,144 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 1 million tokens using the YaRN method.
Source: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct#proc...
- Alifatisk 12 hours ago
  
  > If you read the model card, Qwen3-Next can be extended to 1M context length with YaRN.
  I read the article, but as I said Qwen chat only provides up to 262k tokens in context length, so I'll stick with Qwen2.5 Turbo which supports 1M tokens.
  I am not in a position where I can self-host yet
davidweatherall 11 hours ago
Their proprietary models are very good too and go under the radar, they never seem to appear on any benchmarks. Qwen3-coder-plus is significantly better than their open source qwen3, Qwen3 max also rivals the SOTA models
- ehsanu1 9 hours ago
  
  Are these actually different models vs just different names from the open weights releases?
  
  1 reply →
cpursley 12 hours ago
How are you prepping the PDF data before shoving it into Qwen?
- Alifatisk 12 hours ago
  
  I just compress the file size as low as possible without losing the quality, didn't even know there was more ways to prep it.
  I do sometimes chop up the PDF into smaller pdfs with their own individual chapters
  
  1 reply →
- navbaker 12 hours ago
  
  Not OP, but we use the docling library to extract text and put it in markdown before storing for use with an LLM.

irthomasthomas 16 hours ago

llm -m qwen3-next-80b-a3b-thinking "An ASCII of spongebob"

Here's a classic ASCII art representation of SpongeBob SquarePants:

     .------.
    /  o o  \
   |         |
   |  \___/  |
    \_______/


  llm -m chutes/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
  "An ASCII of spongebob"

  Here's an ASCII art of SpongeBob SquarePants:

      .--..--..--..--..--..--.
    .' \  (`._   (_)     _   \
  .'    |  '._)         (_)  |
  \ _.')\      .----..--.'  /
  |(_.'  |    /    .-\-.  \---.
  \     0|    |   ( O| O) |    |
   |  _  |  .--.____.'._.-.  |
   \ (_) | o         -` .-`  |
    |    \   |`-._ _ _ _ _\ /
    \    |   |  `. |_||_|   |
     | o |    \_       \    |     -.   .-.
     |.-.  \    `--..-'   O |     `.`-' .'
   _.'  .' |     `-.-'      /-.__   ' .-'
  .' `-.` '.|='=.='=.='=.='=|._/_ `-'.'
  `-._  `.  |________/\_____|    `-.'
     .'   ).| '=' '='\/ '=' |
     `._.`  '---------------'
             //___\   //___\
               ||       ||
               ||_.-.   ||_.-.
              (_.--__) (_.--__)

Meta: I generated a few dozen spongebobs last night on the same model and NONE where as good as this. Most started well but collapsed into decoherence at the end - missing the legs off. Then this morning the very same prompt to the same model API produced a perfect bob on the first attempt. Can utilization affect response quality, if all else remains constant? Or was it just random luck?

Edit: Ok, the very next attempt, a few minutes later, failed, so I guess it is just random, and you have about a 1 in 10 chance of getting a perfect spongebob from qwen3-coder, and ~0 chance with qwen3-next.

dev_hugepages 15 hours ago
memorized: https://www.asciiart.eu/cartoons/spongebob-squarepants
- irthomasthomas 15 hours ago
  
  Naturally. That's how LLMs work. During training you measure the loss, the difference between the model output and the ground-truth and try to minimize it. We prize models for their ability to learn. Here we can see that the large model does a great job at learning to draw bob, while the small model performs poorly.
  
  5 replies →
- ricardobeat 14 hours ago
  
  For the model to have memorized the entire sequence of characters precisely, this must appear hundreds of times in the training data?
- ginko 15 hours ago
  
  Conveniently removed the artist's signature though.
  
  7 replies →
matchcc 13 hours ago
I think there is some distillation relationship between Kimi K2 and Qwen Coder or other related other models, or same training data. I tried most of LLMs, only kimi K2 gave the exact same ASCII. kimi K2: Here’s a classic ASCII art of SpongeBob SquarePants for you:
.--..--..--..--..--..--. .' \ (`._ (_) _ \ .' | '._) (_) | \ _.')\ .----..---. / |(_.' | / .-\-. \ | \ 0| | ( O| O) | o| | _ | .--.____.'._.-. | \ (_) | o -` .-` | | \ |`-._ _ _ _ _\ / \ | | `. |_||_| | | o | \_ \ | -. .-. |.-. \ `--..-' O | `.`-' .' _.' .' | `-.-' /-.__ ' .-' .' `-.` '.|='=.='=.='=.='=|._/_ `-'.' `-._ `. |________/\_____| `-.' .' ).| '=' '='\/ '=' | `._.` '---------------' //___\ //___\ || || ||_.-. ||_.-. (_.--__) (_.--__)
Enjoy your SpongeBob ASCII!
- nakamoto_damacy 13 hours ago
  
  For ascii to look right, not messed up, the generator has to know the width of the div in ascii characters, e.g. 80, 240, etc, so it can make sure the lines don't wrap. So how does an LLM know anything about the UI it's serving? Is it just luck? what if you ask it to draw something that like 16:9 in aspect ratio... would it know to scale it dowm so lines won't wrap? how about loss of details if it does? Also, is it as good with Unicode art? So many questions.
  
  2 replies →

irthomasthomas 8 hours ago

I realize my SpongeBob post came off flippant, and that wasn't the intent. The Spongebob ASCII test (picked up from Qwen's own Twitter) is explicitly a rote-memorization probe; bigger dense models usually ace it because sheer parameter count can store the sequence

With Qwen3's sparse-MoE, though, the path to that memory is noisier: two extra stochastic draws (a) which expert(s) fire, (b) which token gets sampled from them. Add the new gated-attention and multi-token heads and you've got a pipeline where a single routing flake or a dud expert can break vertical alignment halfway down the picture.

Anyway, I think qwen3-coder was uniquely trained on this - so it's not a fair comparison. Here are some other qwen3 models:

Model: chutes/Qwen/Qwen3-235B-A22B

   /~\       
  (  *  *  )  
  (  o o o  )  
   \  -  /     
    \  /\  /   
     \  /      
      \/       
     /|||\      
    /|||||\     
   /||||||||\   
  ( o   o   o ) 
   \   W   /   
    \___/

Model: chutes/Qwen/Qwen3-235B-A22B-Instruct-2507

    /\_/\  
   ( o.o ) 
    > ^ <  
   /     \ 
  |       |
  |       |
   \     /
    '-'-'

Model: chutes/Qwen/Qwen3-235B-A22B-Thinking-2507

  .-----------.
  | []     [] |
  |           |
  |   __  __  |
  |  |  ||  | |
  |  |__||__| |
  |           |
  '-----------'

Model: chutes/Qwen/Qwen3-Next-80B-A3B-Instruct

        __
       /  \
      /    \
     /      \
    /        \
   /__________\
  |  o      o  |
  |     __     |
  |  \____/    |
  |            |
  |  ________  |
  |  \      /  |
  |   \____/   |
  |            |
  |____________|
  
     ___________
    /           \
   /             \
  |   _______     |
  |  |       |    |
  |  |  ___  |    |
  |  | |   | |    |
  |  | |___| |    |
  |  |_______|    |
  |               |
  |_______________|

Model: chutes/Qwen/Qwen3-Next-80B-A3B-Thinking

  .-.
   /   \
  |  o o|
  |  >  |
  |  ---|
   \___/

Model: chutes/Qwen/Qwen3-30B-A3B-Instruct-2507

    _________________________
   /                         \
  |   _     _     _     _   |
  |  / \   / \   / \   / \  |
  | |   | |   | |   | |   | |
  |  \_/   \_/   \_/   \_/  |
  |                         |
  |   _     _     _     _   |
  |  / \   / \   / \   / \  |
  | |   | |   | |   | |   | |
  |  \_/   \_/   \_/   \_/  |
  |                         |
  |    SpongeBob SquarePants   |
  |_________________________|

syntaxing 15 hours ago

The craziest part is how far MoE has come thanks to Qwen. This beats all those 72B dense models we’ve had before and runs faster than 14B model depending on how you off load your VRAM and CPU. That’s insane.

halJordan 4 hours ago
Qwen isn't directing the forward progress of llms. SOTA llms have been moe since gpt-4. The og 4.
Out of context, but i honestly hate how HN let itself get so far behind the times that this is the sort of inane commentary we get on AI.
- refulgentis 3 hours ago
  
  I would venture to suggest that to read it as "Qwen made MoEs in toto || first || better than anyone else" is reductive - merely, the # of experts and #s here are quite novel (70b...inferencing only 3b!?!) - I sometimes kick around the same take, but, thought I'd stand up for this. And I know what I'm talking about, I maintain a client that wraps llama.cpp x ~20 models on inference APIs
moffkalast 14 hours ago
In retrospect it's actually funny that last year Meta spent so many resources training a dense 405B model that both underperforms compared to models a tenth its size and is impossible to run at a reasonable speed on any hardware in existence.
- jychang 13 hours ago
  
  Strong disagree.
  Llama 4's release in 2025 is (deservedly) panned, but Llama 3.1 405b does not deserve that slander.
  https://artificialanalysis.ai/#frontier-language-model-intel...
  Do not compare 2024 models to the current cutting edge. At the time, Llama 3.1 405b was the very first open source (open weights) model to come close to the closed source cutting edge. It was very very close in performance to GPT-4o and Claude 3.5 Sonnet.
  In essence, it was Deepseek R1 before Deepseek R1.
  
  4 replies →
- NitpickLawyer 13 hours ago
  
  It's not that clear. Yes, it underperforms in recent benchmarks and usecases (i.e. agentic stuff), but it is still one of the strongest open models in terms of "knowledge". Dense does have that advantage of MoE, even if it's extremely expensive to run inference on.
  Check out this great exercise - https://open.substack.com/pub/outsidetext/p/how-does-a-blind...
  
  1 reply →

jbellis 3 hours ago

Added Qwen3 Next to the Brokk Power Ranking Open Round (coding benchmark). It's roughly GPT-OSS-20b strength.

Full set of open weight model results: https://brokk.ai/power-ranking?version=openround&models=ds-r...

SparkyMcUnicorn 1 hour ago

This would be a valuable benchmark if it included languages other than Java, and let me see which models are best at the languages I work with.
My real-world usage does not line up with these results, but I'm not working with Java.
noahbp 3 hours ago

Is that the updated Kimi K2, or the old Kimi k2?

mmmllm 13 hours ago

The same week Oracle is forecasting huge data center demand and the stock is rallying. If these 10x gains in efficiency hold true then this could lead to a lot less demand for Nvidia, Oracle, Coreweave etc

amelius 13 hours ago
https://en.wikipedia.org/wiki/Jevons_paradox
- mmmllm 12 hours ago
  
  Sure but where is the demand going to come from? LLMs are already in every google search, in Whatsapp/Messenger, throughout Google workspace, Notion, Slack, etc. ChatGPT already has a billion users.
  Plus penetration is already very high in the areas where they are objectively useful: programming, customer care etc. I just don't see where the 100-1000x demand comes from to offset this. Would be happy to hear other views.
  
  28 replies →
ls65536 13 hours ago

I'm not going to speculate about what might be ahead in regards to Oracle's forecasting of data center demand, but regarding the idea of efficiency gains leading to lower demand, don't you think something like Jevons paradox might apply here?
Voloskaya 11 hours ago

People said the same thing for deepseek-r1, and nothing changed.
If you come up with a way to make the current generation of models 10x more efficient, then everyone just moves to train a 10x bigger model. There isn’t a size of model where the players are going to be satisfied at and not go 10x bigger. Not as long as scaling still pays off (and it does today).
stingraycharles 11 hours ago
Absolutely not; the trends have proven that people will just pay for the best quality they can get, and keep paying roughly the same money.
Every time a new model is released, people abandon the old, lower quality model (even when it’s priced less), and instead prefer to pay the same for a better model.
The same will happen with this.
- mmmllm 11 hours ago
  
  Sure but the money people are paying right now isn't that much in the grand scheme of things. OpenAI is expecting 13bn in revenue this year. AWS made over 100bn last year. So unless they pay a lot more, or they find customers outside of programmers, designers, etc who are willing to pay for the best quality, I don't see how it grows as fast as it needs to (I'm not saying it won't increase, just not at the rate expected by the data center providers)
- clvx 11 hours ago
  
  For early adopters yes but many systems have been running as good enough without any kind of updates for a long time. For many use cases it needs to get to a point where accuracy is good enough and then it will be set and forget. I disagree with the approach but that's what you find in the wild.
- Zambyte 11 hours ago
  
  The best quality you can get is at odds with the best speed you can get. There are lots of people (especially with specific use cases) who will pay for the best speed they can get that is high enough quality.
thinkingemote 12 hours ago
If someone had to bet on an AI crash which I imagine would led to unused datacentres and cheap GPUs how would they invest their winnings to exploit these resources?
- CuriouslyC 12 hours ago
  
  If the price of inference drops through the floor all the AI wrapper companies become instantly more valuable. Cursor is living on borrowed time because their agents suck and they're coasting on first mover advantage with weak products in general, but their position would get much better with cheap inference.
- sunir 10 hours ago
  
  Buy the application layer near winners. When computing costs shrink, usage expands.
- kridsdale3 12 hours ago
  
  Assuming your question isn't rhetorical, massive Oracle Crypto Farm.
ACCount37 11 hours ago

No. The gains in inference and training efficiency are going to be absorbed by frontier LLM labs being more willing to push more demanding and capable models to the end users, increase reasoning token budgets, etc.
jstummbillig 12 hours ago

For the last 2 years, despite all efficiency gains, I am literally watching characters appear on my screen, as if this was a hacker movie. Lately, I am also waiting for at least 60s for anything to appear at all.
If that happened at 10x the speed, I would still be slow in computer terms, and that increasingly matter, because I will not be the one reading the stuff – it will be other computers. I think looking back a few years from now, every single piece of silicon that is planned right will look like a laudable but laughable drop in the ocean.
mdp2021 12 hours ago

The real quality demand needs is not there, so more processing is very probably needed, so efficiency gains may allow the extra processing.
(A string example read today of Real quality demand needs: the administration of Albania wants some sort of automated Cabinet Minister. Not just an impartial and incorruptible algorithm (what we normally try to do with deterministic computation): a "minister". Good luck with that.)

mynti 16 hours ago

For anyone curious about what the Gated Delta Network is: https://arxiv.org/pdf/2412.06464

yorwba 15 hours ago

Also, Gated Attention: https://arxiv.org/abs/2505.06708

Jgoauh 16 hours ago

Seems impressive, i believe better architectures are really the path forward, i don't think you need more than 100B params taking this model and what GPT OSS 120B can acchieve

CuriouslyC 12 hours ago

We definitely need more parameters, low param models are hallucination machines, though low actives is probably fine assuming the routing is good.
NitpickLawyer 16 hours ago
New arch seems cool, and it's amazing that we have these published in the open.
That being said, qwen models are extremely overfit. They can do some things well, but they are very limited in generalisation, compared to closed models. I don't know if it's simply scale, or training recipes, or regimes. But if you test it ood the models utterly fail to deliver, where the closed models still provide value.
- vintermann 16 hours ago
  
  Could you give some practical examples? I don't know what Qwen's 36T-token training set is like, so I don't know what it's overfitting to...
  
  4 replies →

lend000 6 hours ago

Prediction: AI will become commoditized ~15 IQ points higher than the state of the art models today, and with larger context, within 4 years as the incremental improvements in training from synthetic data plateaus (we've already used all the "real" data out there) and open source models are cheaply trained on the outputs of the big money models. Then AI development stagnates until someone invents an effective way to use competitive reinforcement learning to train generalized intelligence (similar to how AlphaGo was trained), removing the need for vast quantities of training data. Then, we get real AGI.

rmuratov 5 hours ago

How did we use "all the data"? New knowledge appears on the internet every day, new scientific articles and videos are published.
HardCodedBias 5 hours ago

If that's true and if today's frontier models are around 120 IQ (who knows if that is true, but let's run with it, source: https://www.trackingai.org/home) then we'll have an enormous number of ~135 IQ bots with nearly unlimited conscientiousness.
I can't even begin to understand what that would mean.

jwr 14 hours ago

Hmm. 80B. These days I am on the lookout for new models in the 32B range, since that is what fits and runs comfortably on my MacBook Pro (M4, 64GB).

I use ollama every day for spam filtering: gemma3:27b works great, but I use gpt-oss:20b on a daily basis because it's so much faster and comparable in performance.

jabart 3 hours ago

Can you talk more about how you are using ollama for spam filtering?
bigyabai 8 hours ago
The model is 80b parameters, but only 3b are activated during inference. I'm running the old 2507 Qwen3 30B model on my 8gb Nvidia card and get very usable performance.
- coolspot 5 hours ago
  
  Yes, but you don’t know which 3B parameters you will need, so you have to keep all 80B in your VRAM, or wait until correct 3B are loaded from NVMe->RAM->VRAM. And of course it could be different 3B for each next token.
  
  1 reply →
electroglyph 13 hours ago
it'll run great, it's an moe.

KronisLV 10 hours ago

> The Qwen3-Next-80B-A3B-Instruct performs comparably to our flagship model Qwen3-235B-A22B-Instruct-2507, and shows clear advantages in tasks requiring ultra-long context (up to 256K tokens).

This is pretty impressive and a bit like how the GPT-OSS-120B came out and scored pretty well on the benchmarks despite its somewhat limited size.

That said, using LLMs for software dev use cases, I wouldn't call 256K tokens "ultra-long" context, I regularly go over 100K when working on tasks with bigger scope, e.g.:

  Look at the existing code related to this functionality and the existing design patterns in the code as well as the guidelines.
  Then plan out the implementation in detail and ask me a few questions along the way to figure the details out better.
  Finally, based on everything so far, do the actual implementation.
  Then look it over and tell me if anything has been missed from the plan, then refactor the code in any number of ways.

It could be split up into multiple separate tasks, but I find that the context being more complete (unless the model starts looping garbage, which poisons the context) leads to better results.

My current setup of running Qwen3 Coder 480B on Cerebras bumps into the 131K token limit. If not for the inference speed there (seriously great) and good enough model quality, I'd probably look more in the direction of Gemini or Claude again.

croemer 16 hours ago

https://archive.is/JH9XL

slimebot80 16 hours ago

Complete newbie here - some questions, if I may!

This stuff can run on a local machine without internet access, correct?

And it can pretty much match Nano Banana? https://github.com/PicoTrex/Awesome-Nano-Banana-images/blob/...

Also -- what are the specs for a machine to run it (even if slowly!)

NitpickLawyer 16 hours ago

This model can be run completely offline, yes. You'll need anywhere from 60-200 gb of RAM (either VRAM for high speeds, or a combination of VRAM and RAM, or just CPU+RAM). The active params are really low (3B) so it'll likely run fine even on CPU. Should get 10-15+t/s even on old DDR4 systems. Offload some experts to a GPU (can be as low as 8-16gb) and you'll see greater speeds.
This has nothing to do with nano banana, or image generation. For that you want the qwen image edit[1] models.
1 - https://huggingface.co/Qwen/Qwen-Image-Edit
prawel 16 hours ago

what you mean is Qwen Image and Qwen Image Edit, you can run it on local machine, using Draw Things application for example.
the model discussed here is text model, so similar to ChatGPT. You can also run it on your local machine, but not yet, as apps need to be updated with Qwen 3 Next support (llama.cpp, Ollama, etc)
dragonwriter 16 hours ago

> This stuff can run on a local machine without internet access, correct?
Yes.
> And it can pretty much match Nano Banana?
No, Qwen3-Next is not a multimodal model, it has no image generation function.
Davidzheng 16 hours ago
Isn't this one a text model
- slimebot80 16 hours ago
  
  Ah, maybe! I am lost reading this page with all the terminology
  
  2 replies →

davidpfarrell 7 hours ago

Hyped for the release, but bummed they fell for the ‘next’ naming convention.

What will the actual next advanced release be called:

* next-next

* next (2)

* actual-next-final

binary132 11 hours ago

I’ve been using gpt-oss-120B with CPU MoE offloading on a 24GB GPU and it’s very usable. Excited to see if I can get good results on this now!

kristopolous 13 hours ago

I was getting a bunch of strange hallucinations and weird dialog. It sounds like some exasperated person on the verge of a mental breakdown

jug 9 hours ago

> The Qwen3-Next-80B-A3B-Instruct performs comparably to our flagship model Qwen3-235B-A22B-Instruct-2507

I'm skeptical about these claims. How can this be? Wouldn't there be massive loss of world knowledge? I'm particularly skeptical because a recent trend in Q2 2025 has been benchmaxxing.

dragonwriter 9 hours ago
> I'm skeptical about these claims. How can this be?
More efficient architecture.
> Wouldn't there be massive loss of world knowledge?
If you assume equally efficient architecture and no other salient differences, yes, that’s what you’d expect from a smaller model.
- jug 7 hours ago
  
  Hmm. Let's just say if this is true, that this is actually better with such a much lower total parameter count, it's the greatest accomplishment in over a year of LLM development. With the backdrop of bechmaxxing in 2025, I'll believe in this when I see the results on closed benchmarks and SimpleBench. My concern is this might be a hallucination machine.
  
  2 replies →

techsystems 15 hours ago

How does the context length scaling at 256K tokens compare to Llama's 1M in terms of performance? How are the contexts treated differently?

cchance 8 hours ago

Those rope tests are impressive AF

esafak 11 hours ago

ICYMI qwen3-max was released last week.

Alifatisk 8 hours ago

Was Qwen3-max better than Qwen3-235B-A22B-2507 at anything? Except higher token limit?

pveierland 15 hours ago

> "The content loading failed."

It's amazing how far and how short we've come with software architectures.

yekanchi 16 hours ago

how much vram it requires?

NitpickLawyer 15 hours ago
A good rule of thumb is to think that one param is one unit of storage. The "default" unit of storage these days is bf16 (i.e. 16 bits for 1 weight). So for a 80B model that'll be ~160GB of weights. Then you have quantisation, usually in 8bit and 4bit. That means each weight is "stored" in 8bits or 4bits. So for a 80B model that'll be ~80GB in fp8 and ~40GB in fp4/int4.
But in practice you need a bit more than that. You also need some space for context, and then for kv cache, potentially a model graph, etc.
So you'll see in practice that you need 20-50% more RAM than this rule of thumb.
For this model, you'll need anywhere from 50GB (tight) to 200GB (full) RAM. But it also depends how you run it. With MoE models, you can selectively load some experts (parts of the model) in VRAM, while offloading some in RAM. Or you could run it fully on CPU+RAM, since the active parameters are low - 3B. This should work pretty well even on older systems (DDR4).
- johntash 1 hour ago
  
  Can you explain how context fits into this picture by any chance? I sort of understand the vram requirement for the model itself, but it seems like larger context windows increases the ram requirement by a lot more?
- theanonymousone 14 hours ago
  
  But the RAM+VRAM can never be less than the size of the total (not active) model, right?
  
  3 replies →
DiabloD3 15 hours ago
Thats not a meaningful question. Models can be quantized to fit into much smaller memory requirements, and not all MoE layers (in MoE models) have to be offloaded to VRAM to maintain performance.
- yekanchi 15 hours ago
  
  i mean 4bit quantized. i can roughly calculate vram for dense models by model size. but i don't know how to do it for MOE models?
  
  17 replies →

pzo 10 hours ago

would be interesting how they compare to gpt-oss-120b. The latter one runs also very fast and pricing is currently much better than qwen3-next on many providers. Would expect that if this model is such fast pricing should be similar or even lower.

Western0 12 hours ago

where is gguf?

gre 7 hours ago

https://github.com/ggml-org/llama.cpp/issues/15940
daemonologist 10 hours ago

Patience, it just came out yesterday and has some architectural changes.

boxboxbox4 14 hours ago

[dead]

pollre 12 hours ago

[dead]

keyle 15 hours ago

For a model that can run offline, they've nailed how the website can too.

And it appears like it's thinking about it! /s

croemer 16 hours ago

ERR_NAME_NOT_RESOLVED

siliconc0w 8 hours ago

All these new datacenters are going to be a huge sunk cost. Why would you pay OpenAI when you can host your own hyper efficient Chinese model for like 90% less cost at 90% of the performance. At that is compared to today's subsidized pricing, which they can't keep up forever.

hadlock 8 hours ago

Eventually Nvidia or a shrewd competitor will release 64/128gb consumer cards; locally hosted GPT 3.5+ is right around the corner, we're just waiting for consumer hardware to catch up at this point.
GaggiX 8 hours ago
>to today's subsidized pricing, which they can't keep up forever.
The APIs are not subsidized, they probably have quite the large margin actually: https://lmsys.org/blog/2025-05-05-large-scale-ep/
>Why would you pay OpenAI when you can host your own hyper efficient Chinese model
The 48GB of VRAM or unified memory required to run this model at 4bits is not free either.
- siliconc0w 7 hours ago
  
  I didn't say its free but it is about 90% cheaper. Sonnet is $15 per million token output, this just dropped and is available at OpenRouter at $1.40. Even compared to Gemini Flash which is probably the best price-to-performance API is generally ranked lower than Qwen's models and is $2.50 so still %44 cheaper.