Comment by PlatoIsADisease

11 hours ago

Interesting.

I guess RAG is faster? But I'm realizing I'm outdated now.

No, RAG is definitely preferable once your memory size grows above a few hundred lines of text (which you can just dump into the context for most current models), since you're no longer fighting context limits and needle-in-a-haystack LLM retrieval performance problems.

  • > once your memory size grows above a few hundred lines of text (which you can just dump into the context for most current models)

    A few hundred lines of text is nothing for current LLMs.

    You can dump the entire contents of The Great Gatsby into any of the frontier LLMs and it’s only around 70K tokens. This is less than 1/3 of common context window sizes. That’s even true for models I run locally on modest hardware now.

    The days of chunking everything into paragraphs or pages and building complex workflows to store embeddings, search, and rerank in a big complex pipeline are going away for many common use cases. Having LLMs use simpler tools like grep based on an array of similar search terms and then evaluating what comes up is faster in many cases and doesn’t require elaborate pipelines built around specific context lengths.

    • Yes, but how good will the recall performance be? Just because your prompt fits into context doesn't mean that the model won't be overwhelmed by it.

      When I last tried this with some Gemini models, they couldn't reliably identify specific scenes in a 50K word novel unless I trimmed down the context to a few thousands of words.

      > Having LLMs use simpler tools like grep based on an array of similar search terms and then evaluating what comes up is faster in many cases

      Sure, but then you're dependent on (you or the model) picking the right phrases to search for. With embeddings, you get much better search performance.

      4 replies →

I think it still has a place of your agent is part of a bigger application that you are running and you want to quickly get something in your models context for a quick turnaround