I've found the best approach is to start with traditional full text search. Get it to a point where manual human searches are useful - Especially for users who don't have a stake in the development of an AI solution. Then, look at building a RAG-style solution around the FTS.
I never could get much beyond the basic search piece. I don't see how mixing in a black box AI model with probabilistic outcomes could add any value without having this working first.
You're right, and it's also possible to still use LLMs and vector search in such a system, but instead you use them to enrich the queries made to traditional, pre-existing knowledge bases and search systems. Arguably you could call this "generative assisted retrieval" or GAR.. sadly I didn't coin the term, there's a paper about it ;-) https://aclanthology.org/2021.acl-long.316/
Traditional FTS returns the whole document - people take over from that point and locate the interesting content there. The problem with RAG is that it does not follow that procedure - it tries to find the interesting chunk in one step. Even though since ReAct we know that LLMs could follow the same procedure as humans.
For my application we do a land-and-expand strategy, where we use a mix of BM25 and semantic search to find a chunk, but before showing it to the LLM we then expand to include everything on that page.
It works pretty well. It might benefit from including some material on the page prior and after, but it mostly solves the "isolated chunk" problem.
You can view RAG as a bigger word2vec. The canonical example being "king - man + woman = queen". Words, or now chunks, have geometric distribution, cluster, and relationships... on semantic levels
What is happening is that text is being embedded into a different space, and that format is an array of floats (a point in the embedding space). When we do retrieval, we embed the query and then find other points close to that query. The reason for Vector DB is (1) to optimize for this use-case, we have many specialized data stores / indexes (redis, elastic, dolt, RDBMS) (2) often to be memory based for faster retrieval. PgVector will be interesting to watch. I personally use Qdrant
Full-text search will never be able to do some of the things that are possible in the embedding space. The most capable systems will use both techniques
Inner product similarity in an embedding space is often a very valuable feature in a ranker, and the effort/wow ratio at the prototype phase is good, but the idea that it’s the only pillar of an IR stack is SaaS marketing copy.
Vector DBs are cool, you want one handy (particularly for recommender tasks). I recommend FAISS as a solid baseline all these years later. If you’re on modern x86_64 then SVS is pretty shit hot.
A search engine that only uses a vector DB is a PoC.
For folks who want to go deeper on the topic, Lars basically invented the modern “news feed”, which looks a lot like a production RAG system would [1].
But with FTS you don't solve the "out-of-context chunk problem". You'll still miss relevant chunks with FTS. You still can apply the approach proposed in the post to FTS, but instead of using similarity you could use BM25.
I can’t imagine any serious RAG application is not doing this - adding a contextual title, summary, keywords, and questions to the metadata of each chunk is a pretty low effort/high return implementation.
> adding a contextual title, summary, keywords, and questions
That's interesting; do you then transform the question-as-prompt before embedding it at runtime, so that it "asks for" that metadata to be in the response? Because otherwise, it would seem to me that you're just making it harder for the prompt vector and the document vectors to match.
(I guess, if it's equally harder in all cases, then that might be fine. But if some of your documents have few tags or no title or something, they might be unfairly advantaged in a vector-distance-ranked search, because the formats of the documents more closely resemble the response format the question was expecting...)
You can also train query awareness into the embedding model. This avoids LLMs rewriting questions poorly and lets you embed questions the way your customers actually ask them.
Text embeds don't capture inferred data, like "second letter of this text" does not embed close to "e". LLM chain of thought is required to deduce the meaning more completely.
But there’s no reason why they couldn’t — just capture the vectors of some of the earlier hidden layers during the RAG encoder’s inference run, and append these intermediate vectors to the final embedding vector of the output layer to become the vectors you throw into your vector DB. (And then do the same at runtime for embedding your query prompts.)
Probably you’d want to bias those internal-layer vectors, giving them an increasingly-high “artificial distance” coefficient for increasingly-early layers — so that a document closely matching in token space or word space or syntax-node space improves its retrieval rank a bit, but not nearly as much as if the document were a close match in concept space. (But maybe do something nonlinear instead of multiplication here — you might want near-identical token-wise or syntax-wise matches to show up despite different meanings, depending on your use-case.)
Come to think, you could probably build a pretty good source-code search RAG off of this approach.
(Also, it should hopefully be obvious here that if you fine-tuned an encoder-decoder LLM to label matches based on criteria where some of those criteria are only available in earlier layers, then you’d be training pass-through vector dimensions into the intermediate layers of the encoder — such that using such an encoder on its own for RAG embedding should produce the same effect as capturing + weighting the intermediate layers of a non-fine-tuned LLM.)
If the economical case justifies it you can use a cheap or lower end model to generate the meta information. Considering how cheap gpt-4o-mini is, seems pretty plausible to do that.
At my startup we also got pretty good results using 7B/8B models to generate meta information about chunks/parts of text.
RAG feels hacky to me. We’re coming up with these pseudo-technical solutions to help but really they should be solved at the level of the model by researchers. Until this is solved natively, the attempts will be hacky duct-taped solutions.
RAG is a bit like having a pretty smart person take an open book test on a subject they are not an expert in. If your book has a good chapter layout and index, you probably do an ok job trying to find relevant information, quickly read it, and try to come up with an answer. But your not going to be able to test for a deep understanding of the material. This person is going to struggle if each chapter/concept builds on the previous concept, as you can't just look up something in Chapter 10 and be able to understand it without understanding Chapter 1-9.
Fine-tuning is a bit more like having someone go off and do a phd and specialize in a specific area. They get a much deeper understanding for the problem space and can conceptualize at a different level.
What you said about RAG makes sense, but my understanding is that fine-tuning is actually not very good at getting deeper understanding out of LLMs. It's more useful for teaching general instructions like output format rather than teaching deep concepts like a new domain of science.
There’s probably lack of cpabalities on multiple fronts. RAG might have the right general idea but currently the retrieval seems to be too seperated from the model itself. I don’t know how our brains do it, but retrieval looks to be more integrated there.
Models currently also have no way to update themselves with new info besides us putting data into their context window. They don’t learn after the initial training. It seems if they could, say, read documentation and internalize it, the need for RAG or even large context windows would decrease. Humans somehow are able to build understanding of extensive topics with what feels to be a much shorter context-window.
I think he is saying we should be making fine-tuning or otherwise similar model altering methods easier rather than messing with bolt-on solutions like RAG
Those are being worked on and RAG is the ducktape solution until they become available
What about fresh data like an extremely relevant news headline that was published 10 minutes ago? Private data that I don’t want stored offsite but am okay trusting an enterprise no log api? Providing realtime context to LLMs isn’t “hacky”, model intelligence and RAG can complement each other and make advancements in tandem
I don't think the parents idea was to bake all information into the model, just that current RAG feels cumbersome to use (but then again, so do most things AI right now) and information access should be intrinsic part of the model.
One of my favorite cases is sports chat. I'd expect ChatGPT to be able to talk about sports legends but not be able to talk about a game that happened last weekend. Copilot usually does a good job because it can look up the game on Bing and them summarize but the other day i asked it "What happened last week in the NFL" and it told me about a Buffalo Bills game from last year (did it know I was in the Bills geography?)
Some kind of incremental fine tuning is probably necessary to keep a model like ChatGPT up to date but I can't picture it happening each time something happens in the news.
Fwiw, I used to think this way too but LLMs are more RAG-like internally than we initially realised. Attention is all you need ~= RAG is a big attention mechanism. Models have reverse curse, memorisation issues etc. I personally think of LLMs as a kind of decomposed RAG. Check out DeepMind’s RETRO paper for an even closer integration.
I guess you can imagine an LLM that contains all information there is - but it would have to be at least as big as all information there is or it would have to hallucinate. And also you Not to mention that it seems that you would also require it to learn everything immediately. I don't see any realistic way to reach that goal.
To reach their potential LLMs need to know how to use external sources.
Update: After some more thinking - if you required it to know information about itself - then this would lead to some paradox - I am sure.
The set of techniques for retrieval is immature, but it's important to note that just relying on model context or few-shot prompting has many drawbacks. Perhaps the most important is that retrieval as a task should not rely on generative outputs.
The biggest problem with RAG is that the bottleneck for your product is now the RAG (i.e, results are only as good as what your vector store sends to the LLM). This is a step backwards.
Source: built a few products using RAG+LLM products.
As is typical with any RAG strategy/algorithm, the implicit thing is it works on a specific dataset. Then, it solves a very specific use case. The thing is, if you have a dataset and a use case, you can have a very custom algorithm which would work wonders in terms of output you need. There need not be anything generic.
My instinct at this point is, these algos look attractive because we are constrained to giving a user a wow moment where they upload something and get to chat with the doc/dataset within minutes. As attractive as that is, it is a distinct second priority to building a system that works 99% of the time, even if takes a day or two to set up. You get a feel of the data, have a feel of type of questions that may be asked, and create an algo that works for a specific type of dataset-usecase combo (assuming any more data you add in this system would be similar and work pretty well). There is no silver bullet that we seem to be searching for.
100% agree with you. I've built a # of RAG systems and find that simple Q&A-style use cases actually do fine with traditional chunking approaches.
... and then you have situations where people ask complex questions with multiple logical steps, or knowledge gathering requirements, and using some sort of hierarchical RAG strategy works better.
I think a lot of solutions (including this post) abstract to building knowledge graphs of some sort... But knowledge graphs still require an ontology associated to the problem you're solving and will fail outside of those domains.
The easiest solution to this is to stuff the heading into the chunk. The heading is hierarchical navigation within the sections of the document.
I found Azure Document Intelligence specifically with the Layout Model to be fantastic for this because it can identify headers. All the better if you write a parser for the output JSON to track depth and stuff multiple headers from the path into the chunk.
So subtle! The article is on doing that, which is something we are doing a lot on right now... though it seems to snatch defeat from the jaws of victory:
If we think about what this is about, it is basically entity augmentation & lexical linking / citations.
Ex: A patient document may be all about patient id 123. That won't be spelled out in every paragraph, but by carrying along the patient ID (semantic entity) and the document (citation), the combined model gets access to them. A naive one-shot retrieval over a naive chunked vector index would want it at the text/embedding, while a smarter one also in the entry metadata. And as others write, this helps move reasoning from the symbolic domain to the semantic domain, so less of a hack.
We are working on some fun 'pure-vector' graph RAG work here to tackle production problems around scale, quality, & always-on scenarios like alerting - happy to chat!
Also working with GRAG (via Neo4j) and I'm somewhat skeptical that for most cases where a natural hierarchical structure already exists that graph will significantly exceed RAG with the hierarchical structure.
A better solution I had thought about its "local RAG". I came across this while processing embeddings from chunks parsed from Azure Document Intelligence JSON. The realization is that relevant topics are often localized within a document. Even across a corpus of documents, relevant passages are localized.
Because the chunks are processed sequentially, one needs only to keep track o the sequence number of the chunk. Assume that the embedding matches with a chunk n, then it would follow that the most important context are the chunks localized at n - m and n + p. So find the top x chunks via hybrid embedding + full text match and expand outwards from each of the chunks to grab the chunks around it.
While a chunk may represent just a few sentences of a larger block of text, this strategy will grab possibly the whole section or page of text localized around the chunk with the highest match.
Would it be better to go all the way and completely rewrite the source material in a way more suitable for retrieval? To some extent these headers are a step in that direction, but you’re still at the mercy of the chunk of text being suitable to answer the question.
Instead, completely transforming the text into a dense set of denormalized “notes” that cover every concept present in the text seems like it would be easier to mine for answers to user questions.
Essentially, it would be like taking comprehensive notes from a book and handing them to a friend who didn’t take the class for a test. What would they need to be effective?
Longer term, the sequence would likely be “get question”, hand it to research assistant who has full access to source material and can run a variety of AI / retrieval strategies to customize the notes, and then hand those notes back for answers. By spending more time on the note gathering step, it will be more likely the LLM will be able to answer the question.
For a large corpus, this would be quite expensive in terms of time and storage space. My experience is that embeddings work pretty well around 144-160 tokens (pure trial and error) with clinical trial protocols. I am certain that this value will be different by domain and document types.
If you generate and then "stuff" more text into this, my hunch is that accuracy drops off as the token count increases and it becomes "muddy". GRAG or even normal RAG can solve this to an extent because -- as you propose -- you can generate a congruent "note" and then embed that and link them together.
I'd propose something more flexible: expand on the input query instead and basically multiplex it to the related topics and ideas instead and perform cheap embedding search using more than 1 input vector.
The idea here is to add in higher-level context to the chunk by prepending a chunk header. This chunk header could be as simple as just the document title, or it could use a combination of document title, a concise document summary, and the full hierarchy of section and sub-section titles.
That is from the article. Is this different from your suggested approach?
Yeah exactly, existing benchmark datasets available are underutilized (eg KILT, Natural questions, etc.).
But it is only natural that different QA use cases require different strategies. I built 3 production RAG systems / virtual assistant now, and 4 that didn't make it past PoC and what advanced techniques works really depends on document type, text content and genre, use case, source knowledgebase structure and metadata to exploit etc.
Current go-to is semantic similarity chunking (with overlap) + title or question generation > retriever with fusion on bienc vector sim + classic bm25 + condensed question reformulated QA agent. If you don't get some decent results with that setup there is no hope.
For every project we start the creation of a use-case eval set immediately in parallel with the actual RAG agent, but sometimes the client doesn't think this is priority. We convinced them all it's highly important though, because it is.
Having an evaluation set is doubly important in GenAI projects: a generative system will do unexpected things and an objective measure is needed. Your client will run into weird behaviour when testing and they will get hung up on a 1-in-100 undesirable generation.
How do you weight results between vector search and bm25? Do you fall back to bm25 when vector similarity is below a threshold, or maybe you tweak the weights by hand for each data set?
The definition for RAG that works for me is that you perform some form of "retrieval" (could be full-text search, could be vector search, could be some combination of the two or even another technique like a regular expression search) and you then include the results of that retrieval in the context.
An interesting paper that was recently published that talks about a different approach: Human-like Episodic Memory for Infinite Context LLMs <https://arxiv.org/abs/2407.09450>
This wasn't focused on RAG, but there seems to be a lot of crossover to me. Using the LLM to make "episodes" is a similar problem to chunking, and letting the LLM decide the boundary might also yield good results.
I really want to see some evaluation benchmark comparisons on in-chunk augmentation approaches like this (and question, title, header-generation) and the hybrid retrieval approach where you match at multiple levels: first retrieve/filter on a higher-level summary, title or header, then match the related chunks.
The pure vector approach of in-chunk text augmentation is much simpler of course, but my hypothesis is that the resulting vector will cause too much false positives in retrieval.
In my experience retrieval precision is most commonly the problem not recall with vector similarity. This method will indeed improve recall for out-of-context chunks, but for me recall has not been a problem very often.
“An Outside Context Problem was the sort of thing most civilisations encountered just once, and which they tended to encounter rather in the same way a sentence encountered a full stop.”
(Sorry, I just had to post this quote because it was the first thing that came to my mind when I saw the title, and I've been re-reading Banks lately.)
One quick way to improve results greatly is to ask questions with 2/3 chunks & in the lookup for these chunks mention the IDs of the other chunks, qdrant allows for easy metadata addition. So just generate a synthetic question bank & then do vSearch against the same instead of hoping for the chunks to match up with user questions.
I experience worse IR performance adding title/headers to chunks. It really depends on the nature of the documents. The only successful RAG systems I see are ones specifically tuned to a single domain and document type. If your document collection is diverse in domains or formats, good luck.
I've found the best approach is to start with traditional full text search. Get it to a point where manual human searches are useful - Especially for users who don't have a stake in the development of an AI solution. Then, look at building a RAG-style solution around the FTS.
I never could get much beyond the basic search piece. I don't see how mixing in a black box AI model with probabilistic outcomes could add any value without having this working first.
You're right, and it's also possible to still use LLMs and vector search in such a system, but instead you use them to enrich the queries made to traditional, pre-existing knowledge bases and search systems. Arguably you could call this "generative assisted retrieval" or GAR.. sadly I didn't coin the term, there's a paper about it ;-) https://aclanthology.org/2021.acl-long.316/
Traditional FTS returns the whole document - people take over from that point and locate the interesting content there. The problem with RAG is that it does not follow that procedure - it tries to find the interesting chunk in one step. Even though since ReAct we know that LLMs could follow the same procedure as humans.
But we need an iterative RAG anyway: https://zzbbyy.substack.com/p/why-iterative-thinking-is-cruc...
For my application we do a land-and-expand strategy, where we use a mix of BM25 and semantic search to find a chunk, but before showing it to the LLM we then expand to include everything on that page.
It works pretty well. It might benefit from including some material on the page prior and after, but it mostly solves the "isolated chunk" problem.
I always wondered why a RAG index has to be a vector DB.
If the model understands text/code and can generate text/code it should be able to talk to OpenSearch no problem.
It doesn't have to be a vector DB - and in fact I'm seeing increasing skepticism that embedding vector DBs are the best way to implement RAG.
A full-text search index using BM25 or similar may actually work a lot better for many RAG applications.
I wrote up some notes on building FTS-based RAG here: https://simonwillison.net/2024/Jun/21/search-based-rag/
11 replies →
You can view RAG as a bigger word2vec. The canonical example being "king - man + woman = queen". Words, or now chunks, have geometric distribution, cluster, and relationships... on semantic levels
What is happening is that text is being embedded into a different space, and that format is an array of floats (a point in the embedding space). When we do retrieval, we embed the query and then find other points close to that query. The reason for Vector DB is (1) to optimize for this use-case, we have many specialized data stores / indexes (redis, elastic, dolt, RDBMS) (2) often to be memory based for faster retrieval. PgVector will be interesting to watch. I personally use Qdrant
Full-text search will never be able to do some of the things that are possible in the embedding space. The most capable systems will use both techniques
1 reply →
Inner product similarity in an embedding space is often a very valuable feature in a ranker, and the effort/wow ratio at the prototype phase is good, but the idea that it’s the only pillar of an IR stack is SaaS marketing copy.
Vector DBs are cool, you want one handy (particularly for recommender tasks). I recommend FAISS as a solid baseline all these years later. If you’re on modern x86_64 then SVS is pretty shit hot.
A search engine that only uses a vector DB is a PoC.
For folks who want to go deeper on the topic, Lars basically invented the modern “news feed”, which looks a lot like a production RAG system would [1].
1. https://youtu.be/BuE3DIJGWOw
Honestly you clocked the secret: it doesn’t.
It makes sense for the hype, though. As we got LLM’s we also got wayyyy better embedding models, but they’re not dependencies.
But with FTS you don't solve the "out-of-context chunk problem". You'll still miss relevant chunks with FTS. You still can apply the approach proposed in the post to FTS, but instead of using similarity you could use BM25.
I can’t imagine any serious RAG application is not doing this - adding a contextual title, summary, keywords, and questions to the metadata of each chunk is a pretty low effort/high return implementation.
> adding a contextual title, summary, keywords, and questions
That's interesting; do you then transform the question-as-prompt before embedding it at runtime, so that it "asks for" that metadata to be in the response? Because otherwise, it would seem to me that you're just making it harder for the prompt vector and the document vectors to match.
(I guess, if it's equally harder in all cases, then that might be fine. But if some of your documents have few tags or no title or something, they might be unfairly advantaged in a vector-distance-ranked search, because the formats of the documents more closely resemble the response format the question was expecting...)
You can also train query awareness into the embedding model. This avoids LLMs rewriting questions poorly and lets you embed questions the way your customers actually ask them.
For an example with multimodal: https://www.marqo.ai/blog/generalized-contrastive-learning-f...
But the same approach works with text.
Text embeds don't capture inferred data, like "second letter of this text" does not embed close to "e". LLM chain of thought is required to deduce the meaning more completely.
Given current SOTA, no, they don’t.
But there’s no reason why they couldn’t — just capture the vectors of some of the earlier hidden layers during the RAG encoder’s inference run, and append these intermediate vectors to the final embedding vector of the output layer to become the vectors you throw into your vector DB. (And then do the same at runtime for embedding your query prompts.)
Probably you’d want to bias those internal-layer vectors, giving them an increasingly-high “artificial distance” coefficient for increasingly-early layers — so that a document closely matching in token space or word space or syntax-node space improves its retrieval rank a bit, but not nearly as much as if the document were a close match in concept space. (But maybe do something nonlinear instead of multiplication here — you might want near-identical token-wise or syntax-wise matches to show up despite different meanings, depending on your use-case.)
Come to think, you could probably build a pretty good source-code search RAG off of this approach.
(Also, it should hopefully be obvious here that if you fine-tuned an encoder-decoder LLM to label matches based on criteria where some of those criteria are only available in earlier layers, then you’d be training pass-through vector dimensions into the intermediate layers of the encoder — such that using such an encoder on its own for RAG embedding should produce the same effect as capturing + weighting the intermediate layers of a non-fine-tuned LLM.)
How do you generate keywords in a low effort way for each chunk?
Asking an LLM is low effort to do, but its not efficient nor guaranteed to be correct.
If the economical case justifies it you can use a cheap or lower end model to generate the meta information. Considering how cheap gpt-4o-mini is, seems pretty plausible to do that.
At my startup we also got pretty good results using 7B/8B models to generate meta information about chunks/parts of text.
I agree, most production RAG systems have been doing this since last year
RAG feels hacky to me. We’re coming up with these pseudo-technical solutions to help but really they should be solved at the level of the model by researchers. Until this is solved natively, the attempts will be hacky duct-taped solutions.
I've described it this way to my colleagues:
RAG is a bit like having a pretty smart person take an open book test on a subject they are not an expert in. If your book has a good chapter layout and index, you probably do an ok job trying to find relevant information, quickly read it, and try to come up with an answer. But your not going to be able to test for a deep understanding of the material. This person is going to struggle if each chapter/concept builds on the previous concept, as you can't just look up something in Chapter 10 and be able to understand it without understanding Chapter 1-9.
Fine-tuning is a bit more like having someone go off and do a phd and specialize in a specific area. They get a much deeper understanding for the problem space and can conceptualize at a different level.
What you said about RAG makes sense, but my understanding is that fine-tuning is actually not very good at getting deeper understanding out of LLMs. It's more useful for teaching general instructions like output format rather than teaching deep concepts like a new domain of science.
4 replies →
That's so vague I can't tell what you're suggesting. What specifically do you think needs solving at the model level? What should work differently?
There’s probably lack of cpabalities on multiple fronts. RAG might have the right general idea but currently the retrieval seems to be too seperated from the model itself. I don’t know how our brains do it, but retrieval looks to be more integrated there.
Models currently also have no way to update themselves with new info besides us putting data into their context window. They don’t learn after the initial training. It seems if they could, say, read documentation and internalize it, the need for RAG or even large context windows would decrease. Humans somehow are able to build understanding of extensive topics with what feels to be a much shorter context-window.
9 replies →
I think he is saying we should be making fine-tuning or otherwise similar model altering methods easier rather than messing with bolt-on solutions like RAG
Those are being worked on and RAG is the ducktape solution until they become available
What about fresh data like an extremely relevant news headline that was published 10 minutes ago? Private data that I don’t want stored offsite but am okay trusting an enterprise no log api? Providing realtime context to LLMs isn’t “hacky”, model intelligence and RAG can complement each other and make advancements in tandem
I don't think the parents idea was to bake all information into the model, just that current RAG feels cumbersome to use (but then again, so do most things AI right now) and information access should be intrinsic part of the model.
2 replies →
One of my favorite cases is sports chat. I'd expect ChatGPT to be able to talk about sports legends but not be able to talk about a game that happened last weekend. Copilot usually does a good job because it can look up the game on Bing and them summarize but the other day i asked it "What happened last week in the NFL" and it told me about a Buffalo Bills game from last year (did it know I was in the Bills geography?)
Some kind of incremental fine tuning is probably necessary to keep a model like ChatGPT up to date but I can't picture it happening each time something happens in the news.
1 reply →
Fwiw, I used to think this way too but LLMs are more RAG-like internally than we initially realised. Attention is all you need ~= RAG is a big attention mechanism. Models have reverse curse, memorisation issues etc. I personally think of LLMs as a kind of decomposed RAG. Check out DeepMind’s RETRO paper for an even closer integration.
I guess you can imagine an LLM that contains all information there is - but it would have to be at least as big as all information there is or it would have to hallucinate. And also you Not to mention that it seems that you would also require it to learn everything immediately. I don't see any realistic way to reach that goal.
To reach their potential LLMs need to know how to use external sources.
Update: After some more thinking - if you required it to know information about itself - then this would lead to some paradox - I am sure.
A CL agent is next generation AI.
When CL is properly implemented in an LLM agent format, most of these systems vanish.
The set of techniques for retrieval is immature, but it's important to note that just relying on model context or few-shot prompting has many drawbacks. Perhaps the most important is that retrieval as a task should not rely on generative outputs.
It's also subject to significantly more hallucination when the knowledge is baked into the model, vs being injected into the context at runtime.
The biggest problem with RAG is that the bottleneck for your product is now the RAG (i.e, results are only as good as what your vector store sends to the LLM). This is a step backwards.
Source: built a few products using RAG+LLM products.
As is typical with any RAG strategy/algorithm, the implicit thing is it works on a specific dataset. Then, it solves a very specific use case. The thing is, if you have a dataset and a use case, you can have a very custom algorithm which would work wonders in terms of output you need. There need not be anything generic.
My instinct at this point is, these algos look attractive because we are constrained to giving a user a wow moment where they upload something and get to chat with the doc/dataset within minutes. As attractive as that is, it is a distinct second priority to building a system that works 99% of the time, even if takes a day or two to set up. You get a feel of the data, have a feel of type of questions that may be asked, and create an algo that works for a specific type of dataset-usecase combo (assuming any more data you add in this system would be similar and work pretty well). There is no silver bullet that we seem to be searching for.
100% agree with you. I've built a # of RAG systems and find that simple Q&A-style use cases actually do fine with traditional chunking approaches.
... and then you have situations where people ask complex questions with multiple logical steps, or knowledge gathering requirements, and using some sort of hierarchical RAG strategy works better.
I think a lot of solutions (including this post) abstract to building knowledge graphs of some sort... But knowledge graphs still require an ontology associated to the problem you're solving and will fail outside of those domains.
The easiest solution to this is to stuff the heading into the chunk. The heading is hierarchical navigation within the sections of the document.
I found Azure Document Intelligence specifically with the Layout Model to be fantastic for this because it can identify headers. All the better if you write a parser for the output JSON to track depth and stuff multiple headers from the path into the chunk.
So subtle! The article is on doing that, which is something we are doing a lot on right now... though it seems to snatch defeat from the jaws of victory:
If we think about what this is about, it is basically entity augmentation & lexical linking / citations.
Ex: A patient document may be all about patient id 123. That won't be spelled out in every paragraph, but by carrying along the patient ID (semantic entity) and the document (citation), the combined model gets access to them. A naive one-shot retrieval over a naive chunked vector index would want it at the text/embedding, while a smarter one also in the entry metadata. And as others write, this helps move reasoning from the symbolic domain to the semantic domain, so less of a hack.
We are working on some fun 'pure-vector' graph RAG work here to tackle production problems around scale, quality, & always-on scenarios like alerting - happy to chat!
Also working with GRAG (via Neo4j) and I'm somewhat skeptical that for most cases where a natural hierarchical structure already exists that graph will significantly exceed RAG with the hierarchical structure.
A better solution I had thought about its "local RAG". I came across this while processing embeddings from chunks parsed from Azure Document Intelligence JSON. The realization is that relevant topics are often localized within a document. Even across a corpus of documents, relevant passages are localized.
Because the chunks are processed sequentially, one needs only to keep track o the sequence number of the chunk. Assume that the embedding matches with a chunk n, then it would follow that the most important context are the chunks localized at n - m and n + p. So find the top x chunks via hybrid embedding + full text match and expand outwards from each of the chunks to grab the chunks around it.
While a chunk may represent just a few sentences of a larger block of text, this strategy will grab possibly the whole section or page of text localized around the chunk with the highest match.
5 replies →
[flagged]
Would it be better to go all the way and completely rewrite the source material in a way more suitable for retrieval? To some extent these headers are a step in that direction, but you’re still at the mercy of the chunk of text being suitable to answer the question.
Instead, completely transforming the text into a dense set of denormalized “notes” that cover every concept present in the text seems like it would be easier to mine for answers to user questions.
Essentially, it would be like taking comprehensive notes from a book and handing them to a friend who didn’t take the class for a test. What would they need to be effective?
Longer term, the sequence would likely be “get question”, hand it to research assistant who has full access to source material and can run a variety of AI / retrieval strategies to customize the notes, and then hand those notes back for answers. By spending more time on the note gathering step, it will be more likely the LLM will be able to answer the question.
For a large corpus, this would be quite expensive in terms of time and storage space. My experience is that embeddings work pretty well around 144-160 tokens (pure trial and error) with clinical trial protocols. I am certain that this value will be different by domain and document types.
If you generate and then "stuff" more text into this, my hunch is that accuracy drops off as the token count increases and it becomes "muddy". GRAG or even normal RAG can solve this to an extent because -- as you propose -- you can generate a congruent "note" and then embed that and link them together.
I'd propose something more flexible: expand on the input query instead and basically multiplex it to the related topics and ideas instead and perform cheap embedding search using more than 1 input vector.
Contextual chunk headers
The idea here is to add in higher-level context to the chunk by prepending a chunk header. This chunk header could be as simple as just the document title, or it could use a combination of document title, a concise document summary, and the full hierarchy of section and sub-section titles.
That is from the article. Is this different from your suggested approach?
No, but this is also not really a novel solution.
I'd like to see more evaluation data. There are 100s of RAG strategies, most of them only work on specific types of queries.
Yeah exactly, existing benchmark datasets available are underutilized (eg KILT, Natural questions, etc.).
But it is only natural that different QA use cases require different strategies. I built 3 production RAG systems / virtual assistant now, and 4 that didn't make it past PoC and what advanced techniques works really depends on document type, text content and genre, use case, source knowledgebase structure and metadata to exploit etc.
Current go-to is semantic similarity chunking (with overlap) + title or question generation > retriever with fusion on bienc vector sim + classic bm25 + condensed question reformulated QA agent. If you don't get some decent results with that setup there is no hope.
For every project we start the creation of a use-case eval set immediately in parallel with the actual RAG agent, but sometimes the client doesn't think this is priority. We convinced them all it's highly important though, because it is.
Having an evaluation set is doubly important in GenAI projects: a generative system will do unexpected things and an objective measure is needed. Your client will run into weird behaviour when testing and they will get hung up on a 1-in-100 undesirable generation.
How do you weight results between vector search and bm25? Do you fall back to bm25 when vector similarity is below a threshold, or maybe you tweak the weights by hand for each data set?
2 replies →
RAG is akin to “search engine”.
It’s such a broad term that it’s essentially useless. Nearly anyone doing anything interesting with LLMs is doing RAG.
The definition for RAG that works for me is that you perform some form of "retrieval" (could be full-text search, could be vector search, could be some combination of the two or even another technique like a regular expression search) and you then include the results of that retrieval in the context.
I think it's a useful term.
An interesting paper that was recently published that talks about a different approach: Human-like Episodic Memory for Infinite Context LLMs <https://arxiv.org/abs/2407.09450>
This wasn't focused on RAG, but there seems to be a lot of crossover to me. Using the LLM to make "episodes" is a similar problem to chunking, and letting the LLM decide the boundary might also yield good results.
I really want to see some evaluation benchmark comparisons on in-chunk augmentation approaches like this (and question, title, header-generation) and the hybrid retrieval approach where you match at multiple levels: first retrieve/filter on a higher-level summary, title or header, then match the related chunks.
The pure vector approach of in-chunk text augmentation is much simpler of course, but my hypothesis is that the resulting vector will cause too much false positives in retrieval.
In my experience retrieval precision is most commonly the problem not recall with vector similarity. This method will indeed improve recall for out-of-context chunks, but for me recall has not been a problem very often.
“An Outside Context Problem was the sort of thing most civilisations encountered just once, and which they tended to encounter rather in the same way a sentence encountered a full stop.”
https://www.goodreads.com/quotes/9605621-an-outside-context-...
(Sorry, I just had to post this quote because it was the first thing that came to my mind when I saw the title, and I've been re-reading Banks lately.)
One quick way to improve results greatly is to ask questions with 2/3 chunks & in the lookup for these chunks mention the IDs of the other chunks, qdrant allows for easy metadata addition. So just generate a synthetic question bank & then do vSearch against the same instead of hoping for the chunks to match up with user questions.
I have identified a painpoint where my RAGs are insufficiently answering what I already had with a long tail DAG in production.
I've never seen so many epicycles in my life...
Have you considered this approach? Worked well for us: https://news.ycombinator.com/item?id=40998497
I experience worse IR performance adding title/headers to chunks. It really depends on the nature of the documents. The only successful RAG systems I see are ones specifically tuned to a single domain and document type. If your document collection is diverse in domains or formats, good luck.
[flagged]