← Back to context

Comment by m00dy

11 hours ago

RAG is broken when you have too much data.

Specifically when the document number reaches around 10k+, a phenomenon called "Semantic Collapse" occurs.

https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Halluc...

Gemini with Google search is RAG using all public data, and it isn't broken.

  • It's not tool use with natural language search queries? That's what I'd expect.

    • It's RAG via tool use, where the storage and retreival method is an implementation detail.

      I'm not a huge fan of the term RAG though because if you squint almost all tool use could be considered RAG.

      But if you stick with RAG being a form of "knowledge search" then I think Google search easily fits.

    • It is tool use with natural language search queries but going down a layer they are searched on a vector DB, very similar to RAG. Essentially Google RankBrain is the very far ancestor to RAG before compute and scaling.

Cant you make thresholds higher?

Hmm... I guess not, you might want all that data.

Super interesting topic. Learning a lot.