Comment by SyneRyder
10 hours ago
Just for anyone else who hadn't seen the announcement yet, this Anthropic 1M context is now the same price as the previous 256K context - not the beta where Anthropic charged extra for the 1M window:
https://x.com/claudeai/status/2032509548297343196
As for retrieval, the post shows Opus 4.6 at 78.3% needle retrieval success in 1M window (compared with 91.9% in 256K), and Sonnet 4.6 at 65.1% needle retrieval in 1M (compared with 90.6% in 256K).
Aren't these numbers really bad? > 80% needle retrieval means every fifth memory is akin to a hallucination.
I don't think it quite means that - happy to be corrected on this, but I think it's more like what percentage it can still pay attention to. If you only remembered "cat sat mat", that's only 50% of the phrase "the cat sat on the mat", but you've still paid attention to enough of the right things to be able to fully understand and reconstruct the original. 100% would be akin to memorizing & being able to recite in order every single word that someone said during their conversation with you.
But even if I've misunderstood how attention works, the numbers are relative. GPT 5.4 at 1M only achieves 36% needle retrieval. Gemini 3.1 & GPT 5.4 are only getting 80% at even the 128K point, but I think people would still say those models are highly useful.
It seems to be the hit rate of a very straightforward (literal matching) retrieval. Just checked the benchmark description (https://huggingface.co/datasets/openai/mrcr), here it is:
"The task is as follows: The model is given a long, multi-turn, synthetically generated conversation between user and model where the user asks for a piece of writing about a topic, e.g. "write a poem about tapirs" or "write a blog post about rocks". Hidden in this conversation are 2, 4, or 8 identical asks, and the model is ultimately prompted to return the i-th instance of one of those asks. For example, "Return the 2nd poem about tapirs".
As a side note, steering away from the literal matching crushes performance already at 8k+ tokens: https://arxiv.org/pdf/2502.05167, although the models in this paper are quite old (gpt-4o ish). Would be interesting to run the same benchmark on the newer models
Also, there is strong evidence that aggregating over long context is much more challenging than the "needle extraction task": https://arxiv.org/pdf/2505.08140
All in all, in my opinion, "context rot" is far from being solved
now that's major news