← Back to context

Comment by elliotto

1 day ago

Ah I understand.

My startup provides a vector search system as part of its offering. A user can upload a dataset of records and build a vector index on one of its columns and perform searches. It honestly works incredibly well on a whole bunch of different domains and I was shocked at how useful it was out of the box compared to a conventional BM25 style keyword search. Since we've got this working it's completely changed the way I think about navigating unstructured text data.

If I have a dataset of 100k company website scrapes and I was looking for gyms, and I did a search for 'gym' I would get a whole bunch of conventional gyms. But I would miss companies that described themselves at fitness centers, or aquatic centers or MMA dojos. Vector search picks all of these up, but usually ranks them slightly lower.

If I'm building a RAG bot that is helping me look up companies and I search for a gym, I want the bot to have these extra companies in its context. I can do a vector similarity cutoff, but I can also do a #records cutoff, so that it always has the X most relevant records in its context window.

We've found the fuzziness of the vector search to be a problem in general purpose search cases because people write searches optimizing for keyword match. We had this problem with a company using it for a dataset with highly technical product codes that the embedding search was missing. Our solution was a hybrid keyword / vector search system for these guys that prioritized keyword match but also considered vector similarity. But it's still a big issue to communicate to the user what to write in the embedding search box - whereas in RAG the bot handles all of this.

I think it's an unsolved problem and there continuous to be enormous development in this space.