Comment by mvieira38

2 days ago

To anyone working in these types of applications, are embeddings still worth it compared to agentic search for text? If I have a directory of text files, for example, is it better to save all of their embeddings in a VDB and use that, or are LLMs now good enough that I can just let them use ripgrep or something to search for themselves?

If your LLM is good enough you'll likely get better results from tool calling with grep or a FTS engine - the better models can even adapt their search patterns to search for things like "dog OR canine" where previously vector similarity may have been a bigger win.

Getting embeddings working takes a bunch of work: you need to decide on a chunking strategy, then run the embeddings, then decide how best to store them for fast retrieval. You often end up having to keep your embedding store in memory which can add up for larger volumes of data.

I did a whole lot of work with embeddings last year but I've mostly lost interest now that tool-based-search has become so powerful.

Hooking up tool-based-search that itself uses embeddings is worth exploring, but you may find that the results you get from ripgrep are good enough that it's not worth the considerable extra effort.

It depends on your use case and scale.

If you have a million records of unstructured text (very common, maybe website scrapes of product descriptions, user reviews, etc) you want to be doing an embedding search on these to get the most relevant docs.

If you have a hundred .py files than you want your agent to navigate through these with a grep tool

With the caveat that I have not spent a serious amount of time trying to get RAG to work - my brief attempt to use it via AWS knowledge base to compare it vs agentic search resulted in me sticking with agentic search (via Claude code SDK)

My impression was there’s lots of knobs you can tune with RAG and it’s just more complex in general - so maybe there’s a point where the amount of text I have is large enough that that complexity pays off - but right now agentic search works very well and is significantly simpler to get started with

Semantic search is still important. I'd say that regex search is also quickly rising in importance, especially for coding agents.

Curious but how do we take care of non text files. What if we had a lot of PDF files?

  • Use pymupdf to extract the PDF text. Hell, run that nasty business through an LLM as step-2 to get a beautiful clean markdown version of the text. Lord knows the PDF format is horribly complex!

  • There are plenty of vision capable embedding models, you might not need to OCR, and doing so may could improve or hurt performance.

  • We OCR them with an LLM into markdown. Super expensive and slow but way more reliable than trying to decode insanely structured PDFs that users upload, which often include pages that are images of the text, or diagrams and figures that need to be read.

    Really depends on your scale and speed requirements.

  • You can extract text from PDF files. (there's a number of dedicated models for that, but even the humble pandoc can do it)