Comment by panarky

5 months ago

It loads the entire PDF into context, but then it would be my job to chunk the output for RAG, and just doing arbitrary fixed-size blocks, or breaking on sentences or paragraphs is not ideal.

So I can ask Gemini to return chunks of variable size, where each chunk is a one complete idea or concept, without arbitrarily chopping a logical semantic segment into multiple chunks.

7 comments

panarky

thelittleone 5 months ago

Fixed size chunks is holding back a bunch of RAG projects on my backlog. Will be extremely pleased if this semantic chunking solves the issue. Currently we're getting around an 78-82% success on fixed size chunked RAG which is far too low. Users assume zero results on a RAG search equates to zero results in the source data.

refulgentis 5 months ago
FWIW, you might be doing it / ruled it out already:
- BM25 to eliminate the 0 results in source data problem
- Longer term, a peek at Gwern's recent hierarchical embedding article. Got decent early returns even with fixed size chunks
- thelittleone 5 months ago
  
  Much appreciated.
  For others interested in BM25 for the use case above, I found this thread informative.
  https://news.ycombinator.com/item?id=41034297
- mediaman 5 months ago
  
  Agree, BM25 honestly does an amazing job on its own sometimes, especially if content is technical.
  We use it in combination with semantic but sometimes turn off the semantic part to see what happens and are surprised with the robustness of the results.
  This would work less well for cross-language or less technical content, however. It's great for acronyms, company or industry specific terms, project names, people, technical phrases, and so on.
jacobr1 5 months ago

Also consider methods that are using reasoning to potentially dispatch additional searches based on analysis of the returned data
nnurmanov 5 months ago

This is my problem as well; do you have lots of documents?

Tostino 5 months ago

I wish we had a local model for semantic chunking. I've been wanting one for ages, but haven't had the time to make a dataset and finetune that task =/.