Comment by mingtianzhang
11 days ago
Hi Imanari! That’s essentially one of the key challenges we’re aiming to address with our PageIndex package.
We’ve designed two LLM functions:
a. LLM Function 1: init_content -> initial_structure
b. LLM Function 2: (previous_structure, current_content) -> current_structure
The idea is to split a long document into several page groups (each within the context window size). You first apply Function 1 to the first group to get the initial structure, then use Function 2 in a for-loop over the remaining page groups to recursively build out the rest of the structure.
This approach is commonly used in representation learning for time-series data. We'll be releasing a technical report on it soon as well.
Mingtian
Thanks! I have thought about similar approaches of iteratively building the content-graph of your document base, as you described. I worry about scaling, though. IIUC both previous_structure and current content must fit into context while previous_structure is getting bigger with each iteration, correct?
EDIT: follow up question, how long does the structure-building take for 100 pages and how big are the chunks you are feeding in?