Qwen3-Next

18 hours ago (qwen.ai)

Coolest part of Qwen3-Next, in my opinion, (after the linear attention parts) is that they do MTP without adding another un-embedding matrix.

Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...

But Deepseek R1 adds embed_tokens and shared_head.head tensors, which are [129280, 7168] or about 2GB in size at FP8.

Qwen3-Next doesn't have that: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob...

So it saves a few GB in active parameters for MTP, which is a Big Deal. This is one of the changes that helps significantly speeds up inference.

  • What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?

    • Speculative decoding! It makes inference a LOT faster.

      Instead of generating tokens one at a time, you generate the second one as well, and then use speculative decoding on that second token (instead of having it be produced by a draft model like Qwen 0.6b). If the token is checked and is correct, then the 2nd token gets generated MUCH faster.

      If it's wrong, you have to generate it again the normal way (a lot slower than just checking it). Usually, it's correct, so inference is a lot faster.

      14 replies →

    • > What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?

      It is only useful for inference and doesn't help with pretraining. Which actually points to speculative decoding not being sufficiently general, as the same underlying property (some sequences of tokens are easy to predict) could be exploited for training as well. See here: https://goombalab.github.io/blog/2025/hnet-future/#d-footnot...

    • It could be a better draft model than separately trained EAGLE etc for speculative decoding.

  • How is MTP different from Medusa heads? Also does this mean this model comes "natively" with speculative decoding - meaning if I use this model in vllm, it's throughput should be higher because it is already doing MTP so it should be able to take advantages of speculative decoding?

  • Could someone kindly point to a convenient all-on-one ELI5 of all these words? :')

    • Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate. If you want to understand what's going on, I think the best thing to do is some intro courses, train and design some smaller models directly, get a list of core papers and concepts from Claude/Chat/Gemini, and then as you read something like this, if you don't know the acronym (In this case: MTP = Multi Token Prediction), search it up, and see if you have the basis for understanding what it's about. If not read up on the precursors.

      Unlike many disciplines, AI is an arena that doesn't have a lot of intuitive simplified models that are accurate -- most of the simplified models available do not accurately describe what's going on enough to reason about and understand them. So, you just have to start reading!

      1 reply →

    • Background:

      LLMs take your input, upscale it into a very high dimensional space, and then downscale it back to 1D at the end. This 1D list is interpreted as a list of probabilities -- one for each word in your vocabulary. i.e f(x) = downscale(upscale(x)). Each of downscale() and upscale() are parameterized (billions of params). I see you have a gamedev background, so as an example: bezier curves are parameterized functions where bezier handles are the parameters. During training, these parameters are continuously adjusted so that the output of the overall function gets closer to the expected result. Neural networks are just really flexible functions for which you can choose parameters to get any expected result, provided you have enough of them (similar to bezier curves in this regard).

      ---

      When training, you make an LLM learn that

      I use arch = downscale(upscale(I use))

      If you want to predict the next word after that, you do next in sequence the following:

      I use arch btw = downscale(upscale(I use arch))

      Now, multi-token prediction is having two downscale functions, one for each of the next two words, and learning it that way, basically, you have a second downscale2() that learns how to predict the next-to-next word.

      i.e in parallel:

      I use arch = downscale1(upscale(I use))

      I use ____ btw = downscale2(upscale(I use))

      However, this way you'll need twice the number of parameters downscale needs. And if you want to predict more tokens ahead you'll need even more parameters.

      What Qwen has done, is instead of downscale1 and downscale2 being completely separately parameterized functions, they set downscale1(.) = lightweight1(downscale_common(.)) and downscale2(.) = lightweight2(downscale_common(.)). This is essentially betting that a lot of the logic is common and the difference between predicting the next and next-to-next token can be captured in one lightweight function each. Lightweight here, means less parameters. The bet paid off.

      So overall, you save params.

      Concretely,

      Before: downscale1.params + downscale2.params

      After: downscale_common.params + lightweight1.params + lightweight2.params

      Edit: its actually downscale_common(lightweight()) and not the other way around as I have written above. Doesn't change the crux of the answer, but just including this for clarity.

      6 replies →

    • For me, ChatGPT or any of the other current thinking models are very useful for this type of stuff. I just ask to explain it on my level and then I can ask questions for clarification.

    • The following was generated by chatG5:

          Qwen3-Next — A family of large language models from Qwen (Alibaba).  
          DeepSeek R1 — Another large open-source language model from DeepSeek AI.  
          Linear attention — A type of transformer attention that scales linearly with sequence length, making long-context processing cheaper.  
          MTP (Multi-Token Prediction) — Training/inference trick where the model predicts multiple future tokens at once, speeding things up.  
          Embedding — Converts words/tokens into vectors (numbers) the model can work with.  
          Un-embedding — The reverse step: mapping the model’s internal vector back into tokens.  
          embed_tokens — The big lookup table of embeddings (token → vector).  
          shared_head.head tensors — Extra weight matrices used for prediction; they can be huge.  
          [129280, 7168] — The shape of such a tensor: ~129k rows (tokens in the vocab) × 7k columns (hidden dimension).  
          FP8 — Floating-point format using 8 bits (compact, faster, less precise).  
          Active parameters — The weights that actually need to be loaded in GPU memory to run the model.  
          Inference — Running the model to generate text (as opposed to training it).  
          GB savings — If you avoid duplicating giant matrices, you save GPU memory and speed things up.

Alibaba keeps releasing gold content

I just tried Qwen3-Next-80B-A3B on Qwen chat, and it's fast! The quality seem to match Qwen3-235B-A22B. Quite impressive how they achieved this. Can't wait for the benchmarks at Artificial analysis

According to Qwen Chat, Qwen3-Next has the following limits:

Maximum context length: 262,144 tokens

Max summary generation length: 32,768 tokens

This is 2x higher on context length and 4x higher on summary generation compared to Qwen3-235B-A22B, damn

> Qwen3-Next [...] excels in ultra-long-context understanding and complex tasks

Even though their new hybrid architecture is fascinating, I think I'll continue to stick with Qwen2.5-Turbo because it's one of the few models that supports 1M tokens in context length. My use case is uploading large pdfs and ask questions across chapters

  • My take on long context for many frontier models is not about support but the accuracy drops drastically as you increase the context. Even if a model claims to support 10M context, reality is it doesn’t perform well when you saturate. Curious to hear others perspective on this

    • Agreed. That said, in general a 1M context model has a larger usable window than a 260k context model.

    • This is my experience with Gemini. Yes, I really can put an entire codebase and all the docs and pre-dev discussions and all the inter-engineer chat logs in there.

      I still see the model becoming more intoxicated as turn count gets high.

      1 reply →

  • If you read the model card, Qwen3-Next can be extended to 1M context length with YaRN.

    > Qwen3-Next natively supports context lengths of up to 262,144 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 1 million tokens using the YaRN method.

    Source: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct#proc...

    • > If you read the model card, Qwen3-Next can be extended to 1M context length with YaRN.

      I read the article, but as I said Qwen chat only provides up to 262k tokens in context length, so I'll stick with Qwen2.5 Turbo which supports 1M tokens.

      I am not in a position where I can self-host yet

  • Their proprietary models are very good too and go under the radar, they never seem to appear on any benchmarks. Qwen3-coder-plus is significantly better than their open source qwen3, Qwen3 max also rivals the SOTA models

  • How are you prepping the PDF data before shoving it into Qwen?

    • I just compress the file size as low as possible without losing the quality, didn't even know there was more ways to prep it.

      I do sometimes chop up the PDF into smaller pdfs with their own individual chapters

      1 reply →

    • Not OP, but we use the docling library to extract text and put it in markdown before storing for use with an LLM.

llm -m qwen3-next-80b-a3b-thinking "An ASCII of spongebob"

Here's a classic ASCII art representation of SpongeBob SquarePants:

     .------.
    /  o o  \
   |         |
   |  \___/  |
    \_______/


  llm -m chutes/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
  "An ASCII of spongebob"

  Here's an ASCII art of SpongeBob SquarePants:

      .--..--..--..--..--..--.
    .' \  (`._   (_)     _   \
  .'    |  '._)         (_)  |
  \ _.')\      .----..--.'  /
  |(_.'  |    /    .-\-.  \---.
  \     0|    |   ( O| O) |    |
   |  _  |  .--.____.'._.-.  |
   \ (_) | o         -` .-`  |
    |    \   |`-._ _ _ _ _\ /
    \    |   |  `. |_||_|   |
     | o |    \_       \    |     -.   .-.
     |.-.  \    `--..-'   O |     `.`-' .'
   _.'  .' |     `-.-'      /-.__   ' .-'
  .' `-.` '.|='=.='=.='=.='=|._/_ `-'.'
  `-._  `.  |________/\_____|    `-.'
     .'   ).| '=' '='\/ '=' |
     `._.`  '---------------'
             //___\   //___\
               ||       ||
               ||_.-.   ||_.-.
              (_.--__) (_.--__)


Meta: I generated a few dozen spongebobs last night on the same model and NONE where as good as this. Most started well but collapsed into decoherence at the end - missing the legs off. Then this morning the very same prompt to the same model API produced a perfect bob on the first attempt. Can utilization affect response quality, if all else remains constant? Or was it just random luck?

Edit: Ok, the very next attempt, a few minutes later, failed, so I guess it is just random, and you have about a 1 in 10 chance of getting a perfect spongebob from qwen3-coder, and ~0 chance with qwen3-next.

  • memorized: https://www.asciiart.eu/cartoons/spongebob-squarepants

    • Naturally. That's how LLMs work. During training you measure the loss, the difference between the model output and the ground-truth and try to minimize it. We prize models for their ability to learn. Here we can see that the large model does a great job at learning to draw bob, while the small model performs poorly.

      5 replies →

    • For the model to have memorized the entire sequence of characters precisely, this must appear hundreds of times in the training data?

  • I think there is some distillation relationship between Kimi K2 and Qwen Coder or other related other models, or same training data. I tried most of LLMs, only kimi K2 gave the exact same ASCII. kimi K2: Here’s a classic ASCII art of SpongeBob SquarePants for you:

               .--..--..--..--..--..--.
            .' \  (`._   (_)     _   \
          .'    |  '._)         (_)  |
          \ _.')\      .----..---.   /
          |(_.'  |    /    .-\-.  \  |
          \     0|    |   ( O| O) | o|
           |  _  |  .--.____.'._.-.  |
           \ (_) | o         -` .-`  |
            |    \   |`-._ _ _ _ _\ /
            \    |   |  `. |_||_|   |
            | o  |    \_      \     |     -.   .-.
            |.-.  \     `--..-'   O |     `.`-' .'
          _.'  .' |     `-.-'      /-.__   ' .-'
        .' `-.` '.|='=.='=.='=.='=|._/_ `-'.'
        `-._  `.  |________/\_____|    `-.'
           .'   ).| '=' '='\/ '=' |
           `._.`  '---------------'
                   //___\   //___\
                     ||       ||
                     ||_.-.   ||_.-.
                    (_.--__) (_.--__)

    Enjoy your SpongeBob ASCII!

    • For ascii to look right, not messed up, the generator has to know the width of the div in ascii characters, e.g. 80, 240, etc, so it can make sure the lines don't wrap. So how does an LLM know anything about the UI it's serving? Is it just luck? what if you ask it to draw something that like 16:9 in aspect ratio... would it know to scale it dowm so lines won't wrap? how about loss of details if it does? Also, is it as good with Unicode art? So many questions.

      2 replies →

  • I realize my SpongeBob post came off flippant, and that wasn't the intent. The Spongebob ASCII test (picked up from Qwen's own Twitter) is explicitly a rote-memorization probe; bigger dense models usually ace it because sheer parameter count can store the sequence

    With Qwen3's sparse-MoE, though, the path to that memory is noisier: two extra stochastic draws (a) which expert(s) fire, (b) which token gets sampled from them. Add the new gated-attention and multi-token heads and you've got a pipeline where a single routing flake or a dud expert can break vertical alignment halfway down the picture.

    Anyway, I think qwen3-coder was uniquely trained on this - so it's not a fair comparison. Here are some other qwen3 models:

    Model: chutes/Qwen/Qwen3-235B-A22B

       /~\       
      (  *  *  )  
      (  o o o  )  
       \  -  /     
        \  /\  /   
         \  /      
          \/       
         /|||\      
        /|||||\     
       /||||||||\   
      ( o   o   o ) 
       \   W   /   
        \___/
        

    Model: chutes/Qwen/Qwen3-235B-A22B-Instruct-2507

        /\_/\  
       ( o.o ) 
        > ^ <  
       /     \ 
      |       |
      |       |
       \     /
        '-'-'
    

    Model: chutes/Qwen/Qwen3-235B-A22B-Thinking-2507

      .-----------.
      | []     [] |
      |           |
      |   __  __  |
      |  |  ||  | |
      |  |__||__| |
      |           |
      '-----------'
    
    

    Model: chutes/Qwen/Qwen3-Next-80B-A3B-Instruct

            __
           /  \
          /    \
         /      \
        /        \
       /__________\
      |  o      o  |
      |     __     |
      |  \____/    |
      |            |
      |  ________  |
      |  \      /  |
      |   \____/   |
      |            |
      |____________|
      
         ___________
        /           \
       /             \
      |   _______     |
      |  |       |    |
      |  |  ___  |    |
      |  | |   | |    |
      |  | |___| |    |
      |  |_______|    |
      |               |
      |_______________|
    
    

    Model: chutes/Qwen/Qwen3-Next-80B-A3B-Thinking

      .-.
       /   \
      |  o o|
      |  >  |
      |  ---|
       \___/
      
    

    Model: chutes/Qwen/Qwen3-30B-A3B-Instruct-2507

        _________________________
       /                         \
      |   _     _     _     _   |
      |  / \   / \   / \   / \  |
      | |   | |   | |   | |   | |
      |  \_/   \_/   \_/   \_/  |
      |                         |
      |   _     _     _     _   |
      |  / \   / \   / \   / \  |
      | |   | |   | |   | |   | |
      |  \_/   \_/   \_/   \_/  |
      |                         |
      |    SpongeBob SquarePants   |
      |_________________________|

The craziest part is how far MoE has come thanks to Qwen. This beats all those 72B dense models we’ve had before and runs faster than 14B model depending on how you off load your VRAM and CPU. That’s insane.

  • Qwen isn't directing the forward progress of llms. SOTA llms have been moe since gpt-4. The og 4.

    Out of context, but i honestly hate how HN let itself get so far behind the times that this is the sort of inane commentary we get on AI.

    • I would venture to suggest that to read it as "Qwen made MoEs in toto || first || better than anyone else" is reductive - merely, the # of experts and #s here are quite novel (70b...inferencing only 3b!?!) - I sometimes kick around the same take, but, thought I'd stand up for this. And I know what I'm talking about, I maintain a client that wraps llama.cpp x ~20 models on inference APIs

  • In retrospect it's actually funny that last year Meta spent so many resources training a dense 405B model that both underperforms compared to models a tenth its size and is impossible to run at a reasonable speed on any hardware in existence.

Added Qwen3 Next to the Brokk Power Ranking Open Round (coding benchmark). It's roughly GPT-OSS-20b strength.

Full set of open weight model results: https://brokk.ai/power-ranking?version=openround&models=ds-r...

  • This would be a valuable benchmark if it included languages other than Java, and let me see which models are best at the languages I work with.

    My real-world usage does not line up with these results, but I'm not working with Java.

The same week Oracle is forecasting huge data center demand and the stock is rallying. If these 10x gains in efficiency hold true then this could lead to a lot less demand for Nvidia, Oracle, Coreweave etc

  • https://en.wikipedia.org/wiki/Jevons_paradox

    • Sure but where is the demand going to come from? LLMs are already in every google search, in Whatsapp/Messenger, throughout Google workspace, Notion, Slack, etc. ChatGPT already has a billion users.

      Plus penetration is already very high in the areas where they are objectively useful: programming, customer care etc. I just don't see where the 100-1000x demand comes from to offset this. Would be happy to hear other views.

      28 replies →

  • I'm not going to speculate about what might be ahead in regards to Oracle's forecasting of data center demand, but regarding the idea of efficiency gains leading to lower demand, don't you think something like Jevons paradox might apply here?

  • People said the same thing for deepseek-r1, and nothing changed.

    If you come up with a way to make the current generation of models 10x more efficient, then everyone just moves to train a 10x bigger model. There isn’t a size of model where the players are going to be satisfied at and not go 10x bigger. Not as long as scaling still pays off (and it does today).

  • Absolutely not; the trends have proven that people will just pay for the best quality they can get, and keep paying roughly the same money.

    Every time a new model is released, people abandon the old, lower quality model (even when it’s priced less), and instead prefer to pay the same for a better model.

    The same will happen with this.

    • Sure but the money people are paying right now isn't that much in the grand scheme of things. OpenAI is expecting 13bn in revenue this year. AWS made over 100bn last year. So unless they pay a lot more, or they find customers outside of programmers, designers, etc who are willing to pay for the best quality, I don't see how it grows as fast as it needs to (I'm not saying it won't increase, just not at the rate expected by the data center providers)

    • For early adopters yes but many systems have been running as good enough without any kind of updates for a long time. For many use cases it needs to get to a point where accuracy is good enough and then it will be set and forget. I disagree with the approach but that's what you find in the wild.

    • The best quality you can get is at odds with the best speed you can get. There are lots of people (especially with specific use cases) who will pay for the best speed they can get that is high enough quality.

  • If someone had to bet on an AI crash which I imagine would led to unused datacentres and cheap GPUs how would they invest their winnings to exploit these resources?

    • If the price of inference drops through the floor all the AI wrapper companies become instantly more valuable. Cursor is living on borrowed time because their agents suck and they're coasting on first mover advantage with weak products in general, but their position would get much better with cheap inference.

    • Buy the application layer near winners. When computing costs shrink, usage expands.

  • No. The gains in inference and training efficiency are going to be absorbed by frontier LLM labs being more willing to push more demanding and capable models to the end users, increase reasoning token budgets, etc.

  • For the last 2 years, despite all efficiency gains, I am literally watching characters appear on my screen, as if this was a hacker movie. Lately, I am also waiting for at least 60s for anything to appear at all.

    If that happened at 10x the speed, I would still be slow in computer terms, and that increasingly matter, because I will not be the one reading the stuff – it will be other computers. I think looking back a few years from now, every single piece of silicon that is planned right will look like a laudable but laughable drop in the ocean.

  • The real quality demand needs is not there, so more processing is very probably needed, so efficiency gains may allow the extra processing.

    (A string example read today of Real quality demand needs: the administration of Albania wants some sort of automated Cabinet Minister. Not just an impartial and incorruptible algorithm (what we normally try to do with deterministic computation): a "minister". Good luck with that.)

Seems impressive, i believe better architectures are really the path forward, i don't think you need more than 100B params taking this model and what GPT OSS 120B can acchieve

  • We definitely need more parameters, low param models are hallucination machines, though low actives is probably fine assuming the routing is good.

  • New arch seems cool, and it's amazing that we have these published in the open.

    That being said, qwen models are extremely overfit. They can do some things well, but they are very limited in generalisation, compared to closed models. I don't know if it's simply scale, or training recipes, or regimes. But if you test it ood the models utterly fail to deliver, where the closed models still provide value.

Prediction: AI will become commoditized ~15 IQ points higher than the state of the art models today, and with larger context, within 4 years as the incremental improvements in training from synthetic data plateaus (we've already used all the "real" data out there) and open source models are cheaply trained on the outputs of the big money models. Then AI development stagnates until someone invents an effective way to use competitive reinforcement learning to train generalized intelligence (similar to how AlphaGo was trained), removing the need for vast quantities of training data. Then, we get real AGI.

  • How did we use "all the data"? New knowledge appears on the internet every day, new scientific articles and videos are published.

  • If that's true and if today's frontier models are around 120 IQ (who knows if that is true, but let's run with it, source: https://www.trackingai.org/home) then we'll have an enormous number of ~135 IQ bots with nearly unlimited conscientiousness.

    I can't even begin to understand what that would mean.

Hmm. 80B. These days I am on the lookout for new models in the 32B range, since that is what fits and runs comfortably on my MacBook Pro (M4, 64GB).

I use ollama every day for spam filtering: gemma3:27b works great, but I use gpt-oss:20b on a daily basis because it's so much faster and comparable in performance.

  • The model is 80b parameters, but only 3b are activated during inference. I'm running the old 2507 Qwen3 30B model on my 8gb Nvidia card and get very usable performance.

    • Yes, but you don’t know which 3B parameters you will need, so you have to keep all 80B in your VRAM, or wait until correct 3B are loaded from NVMe->RAM->VRAM. And of course it could be different 3B for each next token.

      1 reply →

> The Qwen3-Next-80B-A3B-Instruct performs comparably to our flagship model Qwen3-235B-A22B-Instruct-2507, and shows clear advantages in tasks requiring ultra-long context (up to 256K tokens).

This is pretty impressive and a bit like how the GPT-OSS-120B came out and scored pretty well on the benchmarks despite its somewhat limited size.

That said, using LLMs for software dev use cases, I wouldn't call 256K tokens "ultra-long" context, I regularly go over 100K when working on tasks with bigger scope, e.g.:

  Look at the existing code related to this functionality and the existing design patterns in the code as well as the guidelines.
  Then plan out the implementation in detail and ask me a few questions along the way to figure the details out better.
  Finally, based on everything so far, do the actual implementation.
  Then look it over and tell me if anything has been missed from the plan, then refactor the code in any number of ways.

It could be split up into multiple separate tasks, but I find that the context being more complete (unless the model starts looping garbage, which poisons the context) leads to better results.

My current setup of running Qwen3 Coder 480B on Cerebras bumps into the 131K token limit. If not for the inference speed there (seriously great) and good enough model quality, I'd probably look more in the direction of Gemini or Claude again.

Complete newbie here - some questions, if I may!

This stuff can run on a local machine without internet access, correct?

And it can pretty much match Nano Banana? https://github.com/PicoTrex/Awesome-Nano-Banana-images/blob/...

Also -- what are the specs for a machine to run it (even if slowly!)

  • This model can be run completely offline, yes. You'll need anywhere from 60-200 gb of RAM (either VRAM for high speeds, or a combination of VRAM and RAM, or just CPU+RAM). The active params are really low (3B) so it'll likely run fine even on CPU. Should get 10-15+t/s even on old DDR4 systems. Offload some experts to a GPU (can be as low as 8-16gb) and you'll see greater speeds.

    This has nothing to do with nano banana, or image generation. For that you want the qwen image edit[1] models.

    1 - https://huggingface.co/Qwen/Qwen-Image-Edit

  • what you mean is Qwen Image and Qwen Image Edit, you can run it on local machine, using Draw Things application for example.

    the model discussed here is text model, so similar to ChatGPT. You can also run it on your local machine, but not yet, as apps need to be updated with Qwen 3 Next support (llama.cpp, Ollama, etc)

  • > This stuff can run on a local machine without internet access, correct?

    Yes.

    > And it can pretty much match Nano Banana?

    No, Qwen3-Next is not a multimodal model, it has no image generation function.

Hyped for the release, but bummed they fell for the ‘next’ naming convention.

What will the actual next advanced release be called:

* next-next

* next (2)

* actual-next-final

I’ve been using gpt-oss-120B with CPU MoE offloading on a 24GB GPU and it’s very usable. Excited to see if I can get good results on this now!

I was getting a bunch of strange hallucinations and weird dialog. It sounds like some exasperated person on the verge of a mental breakdown

> The Qwen3-Next-80B-A3B-Instruct performs comparably to our flagship model Qwen3-235B-A22B-Instruct-2507

I'm skeptical about these claims. How can this be? Wouldn't there be massive loss of world knowledge? I'm particularly skeptical because a recent trend in Q2 2025 has been benchmaxxing.

  • > I'm skeptical about these claims. How can this be?

    More efficient architecture.

    > Wouldn't there be massive loss of world knowledge?

    If you assume equally efficient architecture and no other salient differences, yes, that’s what you’d expect from a smaller model.

    • Hmm. Let's just say if this is true, that this is actually better with such a much lower total parameter count, it's the greatest accomplishment in over a year of LLM development. With the backdrop of bechmaxxing in 2025, I'll believe in this when I see the results on closed benchmarks and SimpleBench. My concern is this might be a hallucination machine.

      2 replies →

How does the context length scaling at 256K tokens compare to Llama's 1M in terms of performance? How are the contexts treated differently?

> "The content loading failed."

It's amazing how far and how short we've come with software architectures.

how much vram it requires?

  • A good rule of thumb is to think that one param is one unit of storage. The "default" unit of storage these days is bf16 (i.e. 16 bits for 1 weight). So for a 80B model that'll be ~160GB of weights. Then you have quantisation, usually in 8bit and 4bit. That means each weight is "stored" in 8bits or 4bits. So for a 80B model that'll be ~80GB in fp8 and ~40GB in fp4/int4.

    But in practice you need a bit more than that. You also need some space for context, and then for kv cache, potentially a model graph, etc.

    So you'll see in practice that you need 20-50% more RAM than this rule of thumb.

    For this model, you'll need anywhere from 50GB (tight) to 200GB (full) RAM. But it also depends how you run it. With MoE models, you can selectively load some experts (parts of the model) in VRAM, while offloading some in RAM. Or you could run it fully on CPU+RAM, since the active parameters are low - 3B. This should work pretty well even on older systems (DDR4).

    • Can you explain how context fits into this picture by any chance? I sort of understand the vram requirement for the model itself, but it seems like larger context windows increases the ram requirement by a lot more?

  • Thats not a meaningful question. Models can be quantized to fit into much smaller memory requirements, and not all MoE layers (in MoE models) have to be offloaded to VRAM to maintain performance.

would be interesting how they compare to gpt-oss-120b. The latter one runs also very fast and pricing is currently much better than qwen3-next on many providers. Would expect that if this model is such fast pricing should be similar or even lower.

For a model that can run offline, they've nailed how the website can too.

And it appears like it's thinking about it! /s

All these new datacenters are going to be a huge sunk cost. Why would you pay OpenAI when you can host your own hyper efficient Chinese model for like 90% less cost at 90% of the performance. At that is compared to today's subsidized pricing, which they can't keep up forever.

  • Eventually Nvidia or a shrewd competitor will release 64/128gb consumer cards; locally hosted GPT 3.5+ is right around the corner, we're just waiting for consumer hardware to catch up at this point.

  • >to today's subsidized pricing, which they can't keep up forever.

    The APIs are not subsidized, they probably have quite the large margin actually: https://lmsys.org/blog/2025-05-05-large-scale-ep/

    >Why would you pay OpenAI when you can host your own hyper efficient Chinese model

    The 48GB of VRAM or unified memory required to run this model at 4bits is not free either.

    • I didn't say its free but it is about 90% cheaper. Sonnet is $15 per million token output, this just dropped and is available at OpenRouter at $1.40. Even compared to Gemini Flash which is probably the best price-to-performance API is generally ranked lower than Qwen's models and is $2.50 so still %44 cheaper.