Comment by simonw
7 months ago
It doesn't have to be a vector DB - and in fact I'm seeing increasing skepticism that embedding vector DBs are the best way to implement RAG.
A full-text search index using BM25 or similar may actually work a lot better for many RAG applications.
I wrote up some notes on building FTS-based RAG here: https://simonwillison.net/2024/Jun/21/search-based-rag/
I've been using SQLite FTS (which is essentially BM25) and it works so well I haven't really bothered with vector databases, or Postgres, or anything else yet. Maybe when my corpus exceeds 2GB...
What are the arguments for embedded vector DBs being suboptimal in RAG, out of curiosity?
The biggest one is that it's hard to get "zero matches" from an embeddings database. You get back all results ordered by distance from the user's query, but it will really scrape the bottom of the barrel if there aren't any great matches - which can lead to bugs like this one: https://simonwillison.net/2024/Jun/6/accidental-prompt-injec...
The other problem is that embeddings search can miss things that a direct keyword match would have caught. If you have key terms that are specific to your corpus - product names for example - there's a risk that a vector match might not score those as highly as BM25 would have so you may miss the most relevant documents.
Finally, embeddings are much more black box and hard to debug and reason about. We have decades of experience tweaking and debugging and improving BM25-style FTS search - the whole field of "Information Retrieval". Throwing that all away in favour of weird new embedding vectors is suboptimal.
>but because embeddings search orders by similarity score it will ALWAYS return results, really scraping the bottom of the barrel if it has to
Why not have a similarity threshold? Say, if the distance is below 0.7, do not accept the search result.
4 replies →
In 2019 I was using vector search to narrow the search space within 100s of millions of documents and then do full text search on the top 10k or so docs.
That seems like a better stacking of the technologies even now
Interesting. Why did you need to “narrow” the search space using vector space? Did you build custom embeddings and feel confident about retrieval segments?
I did similar in 2019 but typically in reverse, FTS, and a dual tower model to rerank. Vector search was an additional capability but never augmented the FTS.
It was in consideration of how slow our FTS at the time was over large amount of documents and the window we wanted to keep response times in and you're correct, we had custom embeddings and we had a reasonably high confidence.
So vector search would reduce the space to like 10k documents and then we'd take the document ids and FTS acted as the final authority on the ranking.