← Back to context

Comment by CharlieDigital

7 months ago

For a large corpus, this would be quite expensive in terms of time and storage space. My experience is that embeddings work pretty well around 144-160 tokens (pure trial and error) with clinical trial protocols. I am certain that this value will be different by domain and document types.

If you generate and then "stuff" more text into this, my hunch is that accuracy drops off as the token count increases and it becomes "muddy". GRAG or even normal RAG can solve this to an extent because -- as you propose -- you can generate a congruent "note" and then embed that and link them together.

I'd propose something more flexible: expand on the input query instead and basically multiplex it to the related topics and ideas instead and perform cheap embedding search using more than 1 input vector.