Comment by qwertox

3 months ago

Llama 4 Scout, Maximum context length: 10M tokens.

This is a nice development.

20 comments

qwertox

Is the recall and reasoning equally good across the entirety of the 10M token window? Cause from what I've seen many of those window claims equate to more like a functional 1/10th or less context length.

vessenes 3 months ago

It’s going to take a while to see how good this window is for real use; they’ve used a couple new ideas to get to 10M token context. Right now the only really good long token model out there is Gemini Pro - and its effectiveness does start dropping maybe in the 200k token range. I imagine insiders at GOOG have access to more than the published 1M token range there.
It will be fun to see what we get here, but I have no doubt the extra tokens will be useful - lots of use cases can do almost as well with summary-level accuracy memory.
littlestymaar 3 months ago

I read somewhere that it has been trained on 256k tokens, and then expanded with RoPE on top of that, not starting from 16k like everyone does IIRC so even if it isn't really flawless at 10M, I'd expect it to be much stronger than its competitors up to those 256k.
stitched2gethr 3 months ago

I very much agree. I've been using Gemini 2.5 pro for coding and I've always given it a simple instruction. Never write comments. It will stop writing them for a time but it's nowhere near the 1M context window.
Now maybe this is more a lack of instruction following than context length but the fact that it works at first and then starts going downhill quickly makes me wary about how much it will pay attention to other details further back in the context.
jimmyl02 3 months ago

the needle in a haystack benchmark looks good but at this point I think we need new benchmarks to test actual understanding of content in such a large window.
MoonGhost 3 months ago

I think the problem is with positional encoding. If model cannot clearly separate tokens in context window they overlap which leads to mess. That encoding matters and actual position does not.
Baeocystin 3 months ago
I assume they're getting these massive windows via RAG trickery, vectorization, and other tricks behind the curtain, became I've noticed the same as you- things start dipping in quality pretty quickly.
Does anyone know if I am correct in my assumption?
- reissbaker 3 months ago
  
  There's no "RAG trickery" or vector search. They changed the way they encode positions such that in theory they're less sensitive to where the token appears in the string.
  That's similar to how previous long-context models worked as well, although the earlier iterations didn't work particularly well, as most have noticed; technically the model "worked" with longer contexts, but it would definitely get dumber. Still too early to tell how this newer variant works, although I'd assume it's at least somewhat better.
- jimmyl02 3 months ago
  
  the large context windows generally involve RoPE[0] which is a trick that allows the training window to be smaller but expand larger during inference. it seems like they have a new "iRoPE" which might have better performance?
  [0]https://arxiv.org/pdf/2104.09864

aimanbenbaha 3 months ago

I don't think RAG will survive this time

inertiatic 3 months ago

4.8b words on English Wikipedia. Knowledge cutoff of 6 months. A valid use case is to search across Wikipedia and ground your answers. Trivially proves that RAG is still needed.
drusepth 3 months ago
RAG still has lots of benefits for anyone paying per input token (e.g. over APIs).
- azinman2 3 months ago
  
  Not to mention latency
  
  1 reply →
acchow 3 months ago

This is only for the small model. The medium model is still at 1M (like Gemini 2.5)
Even if we could get the mid models to 10M, that's still a medium-sized repo at best. Repos size growth will also accelerate as LLMs generate more code. There's no way to catch up.
gesman 3 months ago

RAG gets bigger as everyone else gets bigger. Flooding prompts with garbage is not a sound strategy...

lostmsu 3 months ago

How did they achieve such a long window and what are the memory requirements to utilize it?

miven 3 months ago

According to [0] it's partly due to a key change they introduced in interleaving layers that use standard RoPE positional encodings and layers using what's called NoPE [1], not encoding positions at all and letting the model to figure those out on its own (this exclusively works because the LLMs are autoregressive, so the model can recognize an input token as being the very first by there not yet being any other tokens to attend to, and recursively deriving the position of the subsequent ones from that base case)
[0] https://ai.meta.com/blog/llama-4-multimodal-intelligence/ [1] https://arxiv.org/abs/2305.19466