Comment by Weves
3 years ago
Agree with the point about intelligent chunking being very important! Each individual app connector can choose how it wants to split each `document` into `section`s (important point: this is customized at an app-level). The default chunker then keeps each section as part of a single chunk as much as possible. The goal here is, as you mentioned, to give each chunk the relevant surrounding context.
Additionally, the indexing process is setup as a composable pipeline under the hood. It would be fairly trivial to plug in different chunkers for different sources as needed in the future.
Chunking is very important but might, I feel, best be contextualised as one aspect of the bigger substantive challenge, which is how to prevent false negatives at the context retrieval stage - a.k.a. how to ensure your (vector? hybrid?) search returns all relevant context to the LLM’s context window.
Would you mind saying a few words on how Danswer approaches this?