Comment by siquick
7 months ago
I can’t imagine any serious RAG application is not doing this - adding a contextual title, summary, keywords, and questions to the metadata of each chunk is a pretty low effort/high return implementation.
7 months ago
I can’t imagine any serious RAG application is not doing this - adding a contextual title, summary, keywords, and questions to the metadata of each chunk is a pretty low effort/high return implementation.
> adding a contextual title, summary, keywords, and questions
That's interesting; do you then transform the question-as-prompt before embedding it at runtime, so that it "asks for" that metadata to be in the response? Because otherwise, it would seem to me that you're just making it harder for the prompt vector and the document vectors to match.
(I guess, if it's equally harder in all cases, then that might be fine. But if some of your documents have few tags or no title or something, they might be unfairly advantaged in a vector-distance-ranked search, because the formats of the documents more closely resemble the response format the question was expecting...)
You can also train query awareness into the embedding model. This avoids LLMs rewriting questions poorly and lets you embed questions the way your customers actually ask them.
For an example with multimodal: https://www.marqo.ai/blog/generalized-contrastive-learning-f...
But the same approach works with text.
Text embeds don't capture inferred data, like "second letter of this text" does not embed close to "e". LLM chain of thought is required to deduce the meaning more completely.
Given current SOTA, no, they don’t.
But there’s no reason why they couldn’t — just capture the vectors of some of the earlier hidden layers during the RAG encoder’s inference run, and append these intermediate vectors to the final embedding vector of the output layer to become the vectors you throw into your vector DB. (And then do the same at runtime for embedding your query prompts.)
Probably you’d want to bias those internal-layer vectors, giving them an increasingly-high “artificial distance” coefficient for increasingly-early layers — so that a document closely matching in token space or word space or syntax-node space improves its retrieval rank a bit, but not nearly as much as if the document were a close match in concept space. (But maybe do something nonlinear instead of multiplication here — you might want near-identical token-wise or syntax-wise matches to show up despite different meanings, depending on your use-case.)
Come to think, you could probably build a pretty good source-code search RAG off of this approach.
(Also, it should hopefully be obvious here that if you fine-tuned an encoder-decoder LLM to label matches based on criteria where some of those criteria are only available in earlier layers, then you’d be training pass-through vector dimensions into the intermediate layers of the encoder — such that using such an encoder on its own for RAG embedding should produce the same effect as capturing + weighting the intermediate layers of a non-fine-tuned LLM.)
How do you generate keywords in a low effort way for each chunk?
Asking an LLM is low effort to do, but its not efficient nor guaranteed to be correct.
If the economical case justifies it you can use a cheap or lower end model to generate the meta information. Considering how cheap gpt-4o-mini is, seems pretty plausible to do that.
At my startup we also got pretty good results using 7B/8B models to generate meta information about chunks/parts of text.
I agree, most production RAG systems have been doing this since last year