Comment by julienchastang
15 days ago
I've had limited but good experience (with both English and French text) with Tesseract, then getting ChatGPT to fix problems with clever prompting (e.g., pretend you are an expert OCR corrector, blah blah, blah).
for most (text-dense) documents without much layout differences, these small prompt eng tricks work pretty well! scaling this to complex layouts and 1000+ page docs, we found the models don’t stick to their instructions. perhaps there’s some work to be done with 1M+ context length models so they don’t lose layout memory.
Do any models use some sort of context pruning to keep the [most] relevant parts of the context?
What single documents are you processing that are 1000+ pages?
Is processing one page at a time not feasible? I'm always chunking things as small as possible for LLMs