Comment by laborcontract
3 months ago
General overview below, as the pages don't seem to be working well
Llama 4 Models:
- Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each.
- They are natively multimodal: text + image input, text-only output.
- Key achievements include industry-leading context lengths, strong coding/reasoning performance, and improved multilingual capabilities.
- Knowledge cutoff: August 2024.
Llama 4 Scout:
- 17B active parameters, 16 experts, 109B total.
- Fits on a single H100 GPU (INT4-quantized).
- 10M token context window
- Outperforms previous Llama releases on multimodal tasks while being more resource-friendly.
- Employs iRoPE architecture for efficient long-context attention.
- Tested with up to 8 images per prompt.
Llama 4 Maverick:
- 17B active parameters, 128 experts, 400B total.
- 1M token context window.
- Not single-GPU; runs on one H100 DGX host or can be distributed for greater efficiency.
- Outperforms GPT-4o and Gemini 2.0 Flash on coding, reasoning, and multilingual tests at a competitive cost.
- Maintains strong image understanding and grounded reasoning ability.
Llama 4 Behemoth (Preview):
- 288B active parameters, 16 experts, nearly 2T total.
- Still in training; not yet released.
- Exceeds GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks (e.g., MATH-500, GPQA Diamond).
- Serves as the “teacher” model for Scout and Maverick via co-distillation.
Misc:
- MoE Architecture: Only 17B parameters activated per token, reducing inference cost.
- Native Multimodality: Unified text + vision encoder, pre-trained on large-scale unlabeled data.
For a super ignorant person:
Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each
Those experts are LLM trained on specific tasks or what?
This was an idea that sounded somewhat silly until it was shown it worked. The idea is that you encourage through training a bunch of “experts” to diversify and “get good” at different things. These experts are say 1/10 to 1/100 of your model size if it were a dense model. So you pack them all up into one model, and you add a layer or a few layers that have the job of picking which small expert model is best for your given token input, route it to that small expert, and voila — you’ve turned a full run through the dense parameters into a quick run through a router and then a 1/10 as long run through a little model. How do you get a “picker” that’s good? Well, it’s differentiable, and all we have in ML is a hammer — so, just do gradient descent on the decider while training the experts!
This generally works well, although there are lots and lots of caveats. But it is (mostly) a free lunch, or at least a discounted lunch. I haven’t seen a ton of analysis on what different experts end up doing, but I believe it’s widely agreed that they tend to specialize. Those specializations (especially if you have a small number of experts) may be pretty esoteric / dense in their own right.
Anthropic’s interpretability team would be the ones to give a really high quality look, but I don’t think any of Anthropic’s current models are MoE.
Anecdotally, I feel MoE models sometimes exhibit slightly less “deep” thinking, but I might just be biased towards more weights. And they are undeniably faster and better per second of clock time, GPU time, memory or bandwidth usage — on all of these - than dense models with similar training regimes.
The only thing about this which may be unintuitive from the name is an "Expert" is not something like a sub-llm that's good at math and gets called when you ask a math question. Models like this have layers of networks they run tokens through and each layer is composed of 256 sub-networks, any of which can be selected (or multiple selected and merged in some way) for each layer independently.
So the net result is the same: sets of parameters in the model are specialized and selected for certain inputs. It's just a done a bit deeper in the model than one may assume.
16 replies →
The idea has also been around for at least 15 years; "ensemble learning" was a topic in my "Data Mining" textbook from around then.
Meta calls these individually smaller/weaker models "experts" but I've also heard them referred to as "bozos", because each is not particularly good at anything and it's only together that they are useful. Also bozos has better alliteration with boosting and bagging, two terms that are commonly used in ensemble learning.
1 reply →
If I have 5000 documents about A, and 5000 documents about B, do we know whether it's better to train one large model on all 10,000 documents, or to train 2 different specialist models and then combine them as you describe?
2 replies →
> Anecdotally, I feel MoE models sometimes exhibit slightly less “deep” thinking
Makes sense to compare apples with apples. Same compute amount, right? Or you are giving less time to MoE model and then feel like it underperforms. Shouldn't be surprising...
> These experts are say 1/10 to 1/100 of your model size if it were a dense model
Just to be correct, each layer (attention + fully connected) has it's own router and experts. There are usually 30++ layers. It can't be 1/10 per expert as there are literally hundreds of them.
Cool. Those that mean I could just run the query through the router and then load only the required expert? That is could I feasibly run this on my Macbook?
I've been calling for this approach for a while. It's kinda similar to how the human brain has areas that are good at specific tasks
2 replies →
yes, and it's on a per-layer basis, I think!
So if the model has 16 transformer layers to go through on a forward pass, and each layer, it gets to pick between 16 different choices, that's like 16^16 possible expert combinations!
So this is kind of an ensemble sort of thing in ML like random forest and GBT?
The "Experts" in MoE is less like a panel of doctors and more like having different brain regions with interlinked yet specialized functions.
The models get trained largely the same way as non-MoE models, except with specific parts of the model silo'd apart past a certain layer. The shared part of the model, prior to the splitting, is the "router". The router learns how to route as an AI would, so it's basically a black-box in terms of whatever internal structure emerges from this.
No, it's more like sharding of parameters. There's no understandable distinction between the experts.
I understand they're only optimizing for load distribution, but have people been trying to disentangle what the the various experts learn?
2 replies →
I believe Mixture-of-Experts is a way for a neural network to group certain knowledge into smaller subsets. AFAIK there isn't a specific grouping goal, the network just figures out what goes where on it's own and then when an inference request is made it determines what "expert" would have that knowledge and routes it there. This makes the inference process much more efficient.
https://arxiv.org/abs/1701.06538
Llama 4 Scout, Maximum context length: 10M tokens.
This is a nice development.
Is the recall and reasoning equally good across the entirety of the 10M token window? Cause from what I've seen many of those window claims equate to more like a functional 1/10th or less context length.
It’s going to take a while to see how good this window is for real use; they’ve used a couple new ideas to get to 10M token context. Right now the only really good long token model out there is Gemini Pro - and its effectiveness does start dropping maybe in the 200k token range. I imagine insiders at GOOG have access to more than the published 1M token range there.
It will be fun to see what we get here, but I have no doubt the extra tokens will be useful - lots of use cases can do almost as well with summary-level accuracy memory.
I read somewhere that it has been trained on 256k tokens, and then expanded with RoPE on top of that, not starting from 16k like everyone does IIRC so even if it isn't really flawless at 10M, I'd expect it to be much stronger than its competitors up to those 256k.
I very much agree. I've been using Gemini 2.5 pro for coding and I've always given it a simple instruction. Never write comments. It will stop writing them for a time but it's nowhere near the 1M context window.
Now maybe this is more a lack of instruction following than context length but the fact that it works at first and then starts going downhill quickly makes me wary about how much it will pay attention to other details further back in the context.
the needle in a haystack benchmark looks good but at this point I think we need new benchmarks to test actual understanding of content in such a large window.
I think the problem is with positional encoding. If model cannot clearly separate tokens in context window they overlap which leads to mess. That encoding matters and actual position does not.
I assume they're getting these massive windows via RAG trickery, vectorization, and other tricks behind the curtain, became I've noticed the same as you- things start dipping in quality pretty quickly.
Does anyone know if I am correct in my assumption?
2 replies →
I don't think RAG will survive this time
4.8b words on English Wikipedia. Knowledge cutoff of 6 months. A valid use case is to search across Wikipedia and ground your answers. Trivially proves that RAG is still needed.
RAG still has lots of benefits for anyone paying per input token (e.g. over APIs).
2 replies →
This is only for the small model. The medium model is still at 1M (like Gemini 2.5)
Even if we could get the mid models to 10M, that's still a medium-sized repo at best. Repos size growth will also accelerate as LLMs generate more code. There's no way to catch up.
RAG gets bigger as everyone else gets bigger. Flooding prompts with garbage is not a sound strategy...
How did they achieve such a long window and what are the memory requirements to utilize it?
According to [0] it's partly due to a key change they introduced in interleaving layers that use standard RoPE positional encodings and layers using what's called NoPE [1], not encoding positions at all and letting the model to figure those out on its own (this exclusively works because the LLMs are autoregressive, so the model can recognize an input token as being the very first by there not yet being any other tokens to attend to, and recursively deriving the position of the subsequent ones from that base case)
[0] https://ai.meta.com/blog/llama-4-multimodal-intelligence/ [1] https://arxiv.org/abs/2305.19466
> Knowledge cutoff: August 2024.
Could this mean training time is generally around 6 month, with 2 month of Q/A?
I wish my knowledge cutoff was August 2024.
This made me LOL louder than I have for a long time! Agree.
Couldn’t you gradually include more recent documents as you train?
You can do that but the amount of incremental data will be negligible compared to the rest of the data. Think of the knowledge cutoff more like a soft value.
That makes it harder to analyze the results of training and draw conclusions for the next round.
It scales depending on the dataset you want exposure on and the compute you have available, so any specific time box is kind of meaningless if you don’t know the rest of the inputs that went into it. The llama 3 paper went into a lot of this and how these decisions were made (see section 3 and onward): https://ai.meta.com/research/publications/the-llama-3-herd-o...
tl;dr: llama 3 was 54 days, but it’s more complicated than that.
Thanks for sharing this here. At first I loved the simple Apache-style directory listing, very classic and utilitarian way to navigate new information. Then I tried clicking the FAQ and it wouldn't load anything until I allowed two different sources of JavaScript.
I have a gut feeling, next in line will be 2 or more level of MoE. Further reducing the memory bandwidth and compute requirements. So top level MoE router decides which sub MoE to route.
The solution to all problems in computer science is add a new level of indirection (or abstraction).
Except when the solution is to collapse abstraction in the name of efficiency.
17B puts it beyond the reach of a 4090 ... anybody do 4 bit quant on it yet?
Oh, it'll never run on a 4090. 17B is the active parameter count, not the total param count (and "active" doesn't mean you can slice just those params out and put them on the GPU — which parameters are active constantly changes, even per-token. "Active" just means you get tokens faster than a dense model). It's 109B total parameters, so you'd need at least 54.5GB VRAM just for the weights alone.
A Framework Desktop, Mac Studio, or Nvidia DGX Spark should be able to handle the Scout model locally though... Maybe even at FP8, depending on how much context you need.
Well, Scout should run on the rumored 96GB 4090, since it runs on a single 80GB H100. But, yeah, it'd have to be at sub-2bit quantization to run on a standard 24GB.
Sounds runnable on 2x5090 presumably for $4k if back in stock.
1 reply →
You can swap experts in and out of VRAM, it just increases inference time substantially.
Depending on the routing function you can figure out all the active experts ahead of the forward pass for a single token and pipeline the expert loading.
1 reply →
Unless something’s changed you will need the whole model on the HPU anyway, no? So way beyond a 4090 regardless.
You can still offload most of the model to RAM and use the GPU for compute, but it's obviously much slower than what it would be if everything was on the GPU memory.
see ktransformers: https://www.reddit.com/r/LocalLLaMA/comments/1jpi0n9/ktransf...
2 replies →
A habana just for inference? Are you sure?
Also I see the 4 bit quants put it at a h100 which is fine ... I've got those at work. Maybe there will be distilled for running at home
If their knowledge cutoff is 8 months ago, then how on earth does Grok know things that happened yesterday?
I would really love to know that.
RAG?
At that scale? Is that even possible?
Nice release. I see that everyone is playing the differentiation game now: https://medium.com/thoughts-on-machine-learning/llama-4-and-...