Comment by bob1029
7 months ago
I've found the best approach is to start with traditional full text search. Get it to a point where manual human searches are useful - Especially for users who don't have a stake in the development of an AI solution. Then, look at building a RAG-style solution around the FTS.
I never could get much beyond the basic search piece. I don't see how mixing in a black box AI model with probabilistic outcomes could add any value without having this working first.
You're right, and it's also possible to still use LLMs and vector search in such a system, but instead you use them to enrich the queries made to traditional, pre-existing knowledge bases and search systems. Arguably you could call this "generative assisted retrieval" or GAR.. sadly I didn't coin the term, there's a paper about it ;-) https://aclanthology.org/2021.acl-long.316/
Traditional FTS returns the whole document - people take over from that point and locate the interesting content there. The problem with RAG is that it does not follow that procedure - it tries to find the interesting chunk in one step. Even though since ReAct we know that LLMs could follow the same procedure as humans.
But we need an iterative RAG anyway: https://zzbbyy.substack.com/p/why-iterative-thinking-is-cruc...
For my application we do a land-and-expand strategy, where we use a mix of BM25 and semantic search to find a chunk, but before showing it to the LLM we then expand to include everything on that page.
It works pretty well. It might benefit from including some material on the page prior and after, but it mostly solves the "isolated chunk" problem.
I always wondered why a RAG index has to be a vector DB.
If the model understands text/code and can generate text/code it should be able to talk to OpenSearch no problem.
It doesn't have to be a vector DB - and in fact I'm seeing increasing skepticism that embedding vector DBs are the best way to implement RAG.
A full-text search index using BM25 or similar may actually work a lot better for many RAG applications.
I wrote up some notes on building FTS-based RAG here: https://simonwillison.net/2024/Jun/21/search-based-rag/
I've been using SQLite FTS (which is essentially BM25) and it works so well I haven't really bothered with vector databases, or Postgres, or anything else yet. Maybe when my corpus exceeds 2GB...
What are the arguments for embedded vector DBs being suboptimal in RAG, out of curiosity?
6 replies →
In 2019 I was using vector search to narrow the search space within 100s of millions of documents and then do full text search on the top 10k or so docs.
That seems like a better stacking of the technologies even now
2 replies →
You can view RAG as a bigger word2vec. The canonical example being "king - man + woman = queen". Words, or now chunks, have geometric distribution, cluster, and relationships... on semantic levels
What is happening is that text is being embedded into a different space, and that format is an array of floats (a point in the embedding space). When we do retrieval, we embed the query and then find other points close to that query. The reason for Vector DB is (1) to optimize for this use-case, we have many specialized data stores / indexes (redis, elastic, dolt, RDBMS) (2) often to be memory based for faster retrieval. PgVector will be interesting to watch. I personally use Qdrant
Full-text search will never be able to do some of the things that are possible in the embedding space. The most capable systems will use both techniques
"When we do retrieval, we embed the query and then find other points close to that query."
To me that just sounds like OpenSearch with extra steps.
How is this different/better than a search engine?
Inner product similarity in an embedding space is often a very valuable feature in a ranker, and the effort/wow ratio at the prototype phase is good, but the idea that it’s the only pillar of an IR stack is SaaS marketing copy.
Vector DBs are cool, you want one handy (particularly for recommender tasks). I recommend FAISS as a solid baseline all these years later. If you’re on modern x86_64 then SVS is pretty shit hot.
A search engine that only uses a vector DB is a PoC.
For folks who want to go deeper on the topic, Lars basically invented the modern “news feed”, which looks a lot like a production RAG system would [1].
1. https://youtu.be/BuE3DIJGWOw
Honestly you clocked the secret: it doesn’t.
It makes sense for the hype, though. As we got LLM’s we also got wayyyy better embedding models, but they’re not dependencies.
But with FTS you don't solve the "out-of-context chunk problem". You'll still miss relevant chunks with FTS. You still can apply the approach proposed in the post to FTS, but instead of using similarity you could use BM25.