Comment by mingtianzhang

3 months ago

Hi Imanari! That’s essentially one of the key challenges we’re aiming to address with our PageIndex package.

We’ve designed two LLM functions:

a. LLM Function 1: init_content -> initial_structure

b. LLM Function 2: (previous_structure, current_content) -> current_structure

The idea is to split a long document into several page groups (each within the context window size). You first apply Function 1 to the first group to get the initial structure, then use Function 2 in a for-loop over the remaining page groups to recursively build out the rest of the structure.

This approach is commonly used in representation learning for time-series data. We'll be releasing a technical report on it soon as well.

Mingtian

1 comment

mingtianzhang

Imanari 3 months ago

Thanks! I have thought about similar approaches of iteratively building the content-graph of your document base, as you described. I worry about scaling, though. IIUC both previous_structure and current content must fit into context while previous_structure is getting bigger with each iteration, correct?

EDIT: follow up question, how long does the structure-building take for 100 pages and how big are the chunks you are feeding in?