Comment by jychang

1 day ago

Coolest part of Qwen3-Next, in my opinion, (after the linear attention parts) is that they do MTP without adding another un-embedding matrix.

Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...

But Deepseek R1 adds embed_tokens and shared_head.head tensors, which are [129280, 7168] or about 2GB in size at FP8.

Qwen3-Next doesn't have that: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob...

So it saves a few GB in active parameters for MTP, which is a Big Deal. This is one of the changes that helps significantly speeds up inference.

What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?

  • Speculative decoding! It makes inference a LOT faster.

    Instead of generating tokens one at a time, you generate the second one as well, and then use speculative decoding on that second token (instead of having it be produced by a draft model like Qwen 0.6b). If the token is checked and is correct, then the 2nd token gets generated MUCH faster.

    If it's wrong, you have to generate it again the normal way (a lot slower than just checking it). Usually, it's correct, so inference is a lot faster.

    • Because then the second token only needs to be checked, not generated, as it’s already generated? And it’s much faster to generate multiple tokens at the same time than one at a time? Is that the idea?

      I’m not an expert on LLMs, just a user.

      5 replies →

    • Hmm but isn't the checking only required because the draft model is not the same model and can only speculate what the main one is thinking, hence the name? If the main model generates two tokens itself, then how can it be wrong about its own predictions?

      8 replies →

  • It could be a better draft model than separately trained EAGLE etc for speculative decoding.

  • > What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?

    It is only useful for inference and doesn't help with pretraining. Which actually points to speculative decoding not being sufficiently general, as the same underlying property (some sequences of tokens are easy to predict) could be exploited for training as well. See here: https://goombalab.github.io/blog/2025/hnet-future/#d-footnot...

How is MTP different from Medusa heads? Also does this mean this model comes "natively" with speculative decoding - meaning if I use this model in vllm, it's throughput should be higher because it is already doing MTP so it should be able to take advantages of speculative decoding?

Could someone kindly point to a convenient all-on-one ELI5 of all these words? :')

  • Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate. If you want to understand what's going on, I think the best thing to do is some intro courses, train and design some smaller models directly, get a list of core papers and concepts from Claude/Chat/Gemini, and then as you read something like this, if you don't know the acronym (In this case: MTP = Multi Token Prediction), search it up, and see if you have the basis for understanding what it's about. If not read up on the precursors.

    Unlike many disciplines, AI is an arena that doesn't have a lot of intuitive simplified models that are accurate -- most of the simplified models available do not accurately describe what's going on enough to reason about and understand them. So, you just have to start reading!

    • > Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate.

      I don't think it move this fast.

      I mean there is very little fundamental differences between GPT-2 and gpt-oss-120b, it's just about incremental improvement that don't change much to the full picture (using a variation of the attention architecture and masking, a different activation function, the positional encoding and changing the NLP layers to a sparse “mixture of expert”), at the end of the day, from Mistral to Deepseek going through llama and Qwen3 it's always the same stack of transformers layers with slight variations between two architectures.

      This Qwen3-Next is special though, as it's the first time a major player is releasing something that different (lesser players have made hybrid architecture LLMs for the past two years, but when it comes to language models, IBM really isn't comparable to Alibaba). This is what I expected Llama4 to be.

  • Background:

    LLMs take your input, upscale it into a very high dimensional space, and then downscale it back to 1D at the end. This 1D list is interpreted as a list of probabilities -- one for each word in your vocabulary. i.e f(x) = downscale(upscale(x)). Each of downscale() and upscale() are parameterized (billions of params). I see you have a gamedev background, so as an example: bezier curves are parameterized functions where bezier handles are the parameters. During training, these parameters are continuously adjusted so that the output of the overall function gets closer to the expected result. Neural networks are just really flexible functions for which you can choose parameters to get any expected result, provided you have enough of them (similar to bezier curves in this regard).

    ---

    When training, you make an LLM learn that

    I use arch = downscale(upscale(I use))

    If you want to predict the next word after that, you do next in sequence the following:

    I use arch btw = downscale(upscale(I use arch))

    Now, multi-token prediction is having two downscale functions, one for each of the next two words, and learning it that way, basically, you have a second downscale2() that learns how to predict the next-to-next word.

    i.e in parallel:

    I use arch = downscale1(upscale(I use))

    I use ____ btw = downscale2(upscale(I use))

    However, this way you'll need twice the number of parameters downscale needs. And if you want to predict more tokens ahead you'll need even more parameters.

    What Qwen has done, is instead of downscale1 and downscale2 being completely separately parameterized functions, they set downscale1(.) = lightweight1(downscale_common(.)) and downscale2(.) = lightweight2(downscale_common(.)). This is essentially betting that a lot of the logic is common and the difference between predicting the next and next-to-next token can be captured in one lightweight function each. Lightweight here, means less parameters. The bet paid off.

    So overall, you save params.

    Concretely,

    Before: downscale1.params + downscale2.params

    After: downscale_common.params + lightweight1.params + lightweight2.params

    Edit: its actually downscale_common(lightweight()) and not the other way around as I have written above. Doesn't change the crux of the answer, but just including this for clarity.

    • so after your edit it would be (just to clarify):

          I use ____ ___ = downscale_common(lightweight1(.)) + downscale_common(lightweight2(.)) ?
      

      And does it generate 2 at a time and keep going that way, or is there some overlap?

      1 reply →

    • Dude, this was like that woosh of cool air on your brain when an axe splits your head in half. That really brought a lot of stuff into focus.

  • For me, ChatGPT or any of the other current thinking models are very useful for this type of stuff. I just ask to explain it on my level and then I can ask questions for clarification.

  • The following was generated by chatG5:

        Qwen3-Next — A family of large language models from Qwen (Alibaba).  
        DeepSeek R1 — Another large open-source language model from DeepSeek AI.  
        Linear attention — A type of transformer attention that scales linearly with sequence length, making long-context processing cheaper.  
        MTP (Multi-Token Prediction) — Training/inference trick where the model predicts multiple future tokens at once, speeding things up.  
        Embedding — Converts words/tokens into vectors (numbers) the model can work with.  
        Un-embedding — The reverse step: mapping the model’s internal vector back into tokens.  
        embed_tokens — The big lookup table of embeddings (token → vector).  
        shared_head.head tensors — Extra weight matrices used for prediction; they can be huge.  
        [129280, 7168] — The shape of such a tensor: ~129k rows (tokens in the vocab) × 7k columns (hidden dimension).  
        FP8 — Floating-point format using 8 bits (compact, faster, less precise).  
        Active parameters — The weights that actually need to be loaded in GPU memory to run the model.  
        Inference — Running the model to generate text (as opposed to training it).  
        GB savings — If you avoid duplicating giant matrices, you save GPU memory and speed things up.