← Back to context

Comment by mingtianzhang

11 days ago

Hi Imanari! That’s essentially one of the key challenges we’re aiming to address with our PageIndex package.

We’ve designed two LLM functions:

a. LLM Function 1: init_content -> initial_structure

b. LLM Function 2: (previous_structure, current_content) -> current_structure

The idea is to split a long document into several page groups (each within the context window size). You first apply Function 1 to the first group to get the initial structure, then use Function 2 in a for-loop over the remaining page groups to recursively build out the rest of the structure.

This approach is commonly used in representation learning for time-series data. We'll be releasing a technical report on it soon as well.

Mingtian

Thanks! I have thought about similar approaches of iteratively building the content-graph of your document base, as you described. I worry about scaling, though. IIUC both previous_structure and current content must fit into context while previous_structure is getting bigger with each iteration, correct?

EDIT: follow up question, how long does the structure-building take for 100 pages and how big are the chunks you are feeding in?