Comment by leonidasv
18 days ago
That 1M tokens context window alone is going to kill a lot of RAG use cases. Crazy to see how we went from 4K tokens context windows (2023 ChatGPT-3.5) to 1M in less than 2 years.
18 days ago
That 1M tokens context window alone is going to kill a lot of RAG use cases. Crazy to see how we went from 4K tokens context windows (2023 ChatGPT-3.5) to 1M in less than 2 years.
We have heard this before when 100k and 200k were first being normalized by Anthropic way back when and I tend to be skeptical in general when it comes to such predictions, but in this case, I have to agree.
Having used the previews for the last few weeks with different tasks and personally designed challenges, what I found is that these models are not only capable of processing larger context windows on paper, but are also far better at actually handling long, dense, complex documents in full. Referencing back to something upon specific request, doing extensive rewrites in full whilst handling previous context, etc. These models also have handled my private needle in haystack-type challenges without issues as of yet, though those have been limited to roughly 200k in fairness. Neither Anthropics, OpenAIs, Deepseeks or previous Google models handled even 75k+ in any comparable manner.
Cost will of course remain a factor and will keep RAG a viable choice for a while, but for the first time I am tempted to agree that someone has delivered a solution which showcases that a larger context window can in many cases work reliably and far more seemlessly.
Is also the first time a Google model actually surprised me (positively), neither Bard, nor AI answers or any previous Gemini model had any appeal to me, even when testing specificially for what other claimed to be strenghts (such as Gemini 1.5s alleged Flutter expertise which got beaten by both OpenAI and Anthropics equivalent at the time).
That's not really my experience. Error rate goes up the more stuff you cram into the context, and processing gets both slower and more expensive with the amount of input tokens.
I'd say it makes sense to do RAG even if your stuff fits into context comfortably.
Try exp-1206. That thing works on large context.
> That 1M tokens context window
2M context window on Gemini 2.0 Pro: https://deepmind.google/technologies/gemini/pro/
> is going to kill a lot of RAG use cases.
I have a high level understanding of LLMs and am a generalist software engineer.
Can you elaborate on how exactly these insanely large (and now cheap) context windows will kill a lot of RAG use cases?
If a model has 4K input context and you have a document or code base with 40K, then you have to split it up. The system prompt, user prompt, and output token budget all eat into this. You might need hundreds of small pieces, which typically end up in a vector database for RAG retrieval.
With a million tokens you can shove several short books into the prompt and just skip all that. That’s an entire small-ish codebase.
A colleague used a HTML dump of every config and config policy from a Windows network, pasted it into Gemini and started asking questions. It’s just that easy now!
Maybe someone knows, what's the usual recommendation regarding big context windows? Is it safe to use it to the max, or performance will degrade and we should adapt the maximum to our use case?
Gemini can in theory handle 10M tokens, I remember they saying it in one of their presentations.