Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate. If you want to understand what's going on, I think the best thing to do is some intro courses, train and design some smaller models directly, get a list of core papers and concepts from Claude/Chat/Gemini, and then as you read something like this, if you don't know the acronym (In this case: MTP = Multi Token Prediction), search it up, and see if you have the basis for understanding what it's about. If not read up on the precursors.
Unlike many disciplines, AI is an arena that doesn't have a lot of intuitive simplified models that are accurate -- most of the simplified models available do not accurately describe what's going on enough to reason about and understand them. So, you just have to start reading!
> Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate.
I don't think it move this fast.
I mean there is very little fundamental differences between GPT-2 and gpt-oss-120b, it's just about incremental improvement that don't change much to the full picture (using a variation of the attention architecture and masking, a different activation function, the positional encoding and changing the NLP layers to a sparse “mixture of expert”), at the end of the day, from Mistral to Deepseek going through llama and Qwen3 it's always the same stack of transformers layers with slight variations between two architectures.
This Qwen3-Next is special though, as it's the first time a major player is releasing something that different (lesser players have made hybrid architecture LLMs for the past two years, but when it comes to language models, IBM really isn't comparable to Alibaba). This is what I expected Llama4 to be.
LLMs take your input, upscale it into a very high dimensional space, and then downscale it back to 1D at the end. This 1D list is interpreted as a list of probabilities -- one for each word in your vocabulary. i.e f(x) = downscale(upscale(x)). Each of downscale() and upscale() are parameterized (billions of params). I see you have a gamedev background, so as an example: bezier curves are parameterized functions where bezier handles are the parameters. During training, these parameters are continuously adjusted so that the output of the overall function gets closer to the expected result. Neural networks are just really flexible functions for which you can choose parameters to get any expected result, provided you have enough of them (similar to bezier curves in this regard).
---
When training, you make an LLM learn that
I use arch = downscale(upscale(I use))
If you want to predict the next word after that, you do next in sequence the following:
I use arch btw = downscale(upscale(I use arch))
Now, multi-token prediction is having two downscale functions, one for each of the next two words, and learning it that way, basically, you have a second downscale2() that learns how to predict the next-to-next word.
i.e
in parallel:
I use arch = downscale1(upscale(I use))
I use ____ btw = downscale2(upscale(I use))
However, this way you'll need twice the number of parameters downscale needs. And if you want to predict more tokens ahead you'll need even more parameters.
What Qwen has done, is instead of downscale1 and downscale2 being completely separately parameterized functions, they set downscale1(.) = lightweight1(downscale_common(.)) and downscale2(.) = lightweight2(downscale_common(.)). This is essentially betting that a lot of the logic is common and the difference between predicting the next and next-to-next token can be captured in one lightweight function each. Lightweight here, means less parameters. The bet paid off.
Edit: its actually downscale_common(lightweight()) and not the other way around as I have written above. Doesn't change the crux of the answer, but just including this for clarity.
You generate blocks of 2 at a time yes. In general, k. As you can imagine, larger k performs worse. LLM(I like cats) is very likely to continue with "because they", but beyond that, there's too many possibilities. LLM(I like cats because they are) = small and cute and they meow, while LLM(I like cats because they eat) = all the rats in my garden.
If you try to predict the whole thing at once you might end up with
I like cats because they are all the rats and they garden
> Overlap
Check out an inference method called self-speculative decoding which solves(somewhat) the above problem of k-token prediction, which does overlap the same ___ across multiple computations.
For me, ChatGPT or any of the other current thinking models are very useful for this type of stuff. I just ask to explain it on my level and then I can ask questions for clarification.
Qwen3-Next — A family of large language models from Qwen (Alibaba).
DeepSeek R1 — Another large open-source language model from DeepSeek AI.
Linear attention — A type of transformer attention that scales linearly with sequence length, making long-context processing cheaper.
MTP (Multi-Token Prediction) — Training/inference trick where the model predicts multiple future tokens at once, speeding things up.
Embedding — Converts words/tokens into vectors (numbers) the model can work with.
Un-embedding — The reverse step: mapping the model’s internal vector back into tokens.
embed_tokens — The big lookup table of embeddings (token → vector).
shared_head.head tensors — Extra weight matrices used for prediction; they can be huge.
[129280, 7168] — The shape of such a tensor: ~129k rows (tokens in the vocab) × 7k columns (hidden dimension).
FP8 — Floating-point format using 8 bits (compact, faster, less precise).
Active parameters — The weights that actually need to be loaded in GPU memory to run the model.
Inference — Running the model to generate text (as opposed to training it).
GB savings — If you avoid duplicating giant matrices, you save GPU memory and speed things up.
The best primer I've seen is Andrej Karpathy's first video in his "zero to hero" series. It's worth following along with your own practice.
https://karpathy.ai/zero-to-hero.html
A while ago I had caught up with the basics thanks to the legendary 3blue1brown and his playlist on Neural Networks:
https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQ...
and other helpers like Artem Kirsanov:
https://www.youtube.com/watch?v=SmZmBKc7Lrs
Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate. If you want to understand what's going on, I think the best thing to do is some intro courses, train and design some smaller models directly, get a list of core papers and concepts from Claude/Chat/Gemini, and then as you read something like this, if you don't know the acronym (In this case: MTP = Multi Token Prediction), search it up, and see if you have the basis for understanding what it's about. If not read up on the precursors.
Unlike many disciplines, AI is an arena that doesn't have a lot of intuitive simplified models that are accurate -- most of the simplified models available do not accurately describe what's going on enough to reason about and understand them. So, you just have to start reading!
> Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate.
I don't think it move this fast.
I mean there is very little fundamental differences between GPT-2 and gpt-oss-120b, it's just about incremental improvement that don't change much to the full picture (using a variation of the attention architecture and masking, a different activation function, the positional encoding and changing the NLP layers to a sparse “mixture of expert”), at the end of the day, from Mistral to Deepseek going through llama and Qwen3 it's always the same stack of transformers layers with slight variations between two architectures.
This Qwen3-Next is special though, as it's the first time a major player is releasing something that different (lesser players have made hybrid architecture LLMs for the past two years, but when it comes to language models, IBM really isn't comparable to Alibaba). This is what I expected Llama4 to be.
Background:
LLMs take your input, upscale it into a very high dimensional space, and then downscale it back to 1D at the end. This 1D list is interpreted as a list of probabilities -- one for each word in your vocabulary. i.e f(x) = downscale(upscale(x)). Each of downscale() and upscale() are parameterized (billions of params). I see you have a gamedev background, so as an example: bezier curves are parameterized functions where bezier handles are the parameters. During training, these parameters are continuously adjusted so that the output of the overall function gets closer to the expected result. Neural networks are just really flexible functions for which you can choose parameters to get any expected result, provided you have enough of them (similar to bezier curves in this regard).
---
When training, you make an LLM learn that
I use arch = downscale(upscale(I use))
If you want to predict the next word after that, you do next in sequence the following:
I use arch btw = downscale(upscale(I use arch))
Now, multi-token prediction is having two downscale functions, one for each of the next two words, and learning it that way, basically, you have a second downscale2() that learns how to predict the next-to-next word.
i.e in parallel:
I use arch = downscale1(upscale(I use))
I use ____ btw = downscale2(upscale(I use))
However, this way you'll need twice the number of parameters downscale needs. And if you want to predict more tokens ahead you'll need even more parameters.
What Qwen has done, is instead of downscale1 and downscale2 being completely separately parameterized functions, they set downscale1(.) = lightweight1(downscale_common(.)) and downscale2(.) = lightweight2(downscale_common(.)). This is essentially betting that a lot of the logic is common and the difference between predicting the next and next-to-next token can be captured in one lightweight function each. Lightweight here, means less parameters. The bet paid off.
So overall, you save params.
Concretely,
Before: downscale1.params + downscale2.params
After: downscale_common.params + lightweight1.params + lightweight2.params
Edit: its actually downscale_common(lightweight()) and not the other way around as I have written above. Doesn't change the crux of the answer, but just including this for clarity.
so after your edit it would be (just to clarify):
And does it generate 2 at a time and keep going that way, or is there some overlap?
You generate blocks of 2 at a time yes. In general, k. As you can imagine, larger k performs worse. LLM(I like cats) is very likely to continue with "because they", but beyond that, there's too many possibilities. LLM(I like cats because they are) = small and cute and they meow, while LLM(I like cats because they eat) = all the rats in my garden.
If you try to predict the whole thing at once you might end up with
I like cats because they are all the rats and they garden
> Overlap
Check out an inference method called self-speculative decoding which solves(somewhat) the above problem of k-token prediction, which does overlap the same ___ across multiple computations.
Ooooh, neat! That was very well explained, thank you.
> I see you have a gamedev background
Thanks for the tailored response! ^^
Really good
Dude, this was like that woosh of cool air on your brain when an axe splits your head in half. That really brought a lot of stuff into focus.
For me, ChatGPT or any of the other current thinking models are very useful for this type of stuff. I just ask to explain it on my level and then I can ask questions for clarification.
The following was generated by chatG5: