← Back to context

Comment by ec109685

7 months ago

Would it be better to go all the way and completely rewrite the source material in a way more suitable for retrieval? To some extent these headers are a step in that direction, but you’re still at the mercy of the chunk of text being suitable to answer the question.

Instead, completely transforming the text into a dense set of denormalized “notes” that cover every concept present in the text seems like it would be easier to mine for answers to user questions.

Essentially, it would be like taking comprehensive notes from a book and handing them to a friend who didn’t take the class for a test. What would they need to be effective?

Longer term, the sequence would likely be “get question”, hand it to research assistant who has full access to source material and can run a variety of AI / retrieval strategies to customize the notes, and then hand those notes back for answers. By spending more time on the note gathering step, it will be more likely the LLM will be able to answer the question.

For a large corpus, this would be quite expensive in terms of time and storage space. My experience is that embeddings work pretty well around 144-160 tokens (pure trial and error) with clinical trial protocols. I am certain that this value will be different by domain and document types.

If you generate and then "stuff" more text into this, my hunch is that accuracy drops off as the token count increases and it becomes "muddy". GRAG or even normal RAG can solve this to an extent because -- as you propose -- you can generate a congruent "note" and then embed that and link them together.

I'd propose something more flexible: expand on the input query instead and basically multiplex it to the related topics and ideas instead and perform cheap embedding search using more than 1 input vector.