Comment by HarHarVeryFunny
6 days ago
Is it really necessary to split it into pages? Not so bad if you automate it I suppose, but aren't there models that will accept a large PDF directly (I know Sonnet has a 32MB limit)?
6 days ago
Is it really necessary to split it into pages? Not so bad if you automate it I suppose, but aren't there models that will accept a large PDF directly (I know Sonnet has a 32MB limit)?
They are limited on how much they can output and there is generally an inverse relationship between the amount of tokens you send vs quality after the first 20-30 thousand tokens.
Are there papers on this effect? That quality of responses diminishes with very large inputs I mean. I observed the same.
I think these models all "cheat" to some extent with their long context lengths.
The original transformer had dense attention where every token attends to every other token, and the computational cost therefore grew quadratically with increased context length. There are other attention patterns than can be used though, such as only attending to recent tokens (sliding window attention), or only having a few global tokens that attend to all the others, or even attending to random tokens, or using combinations of these (e.g. Google's "Big Bird" attention from their Elmo/Bert muppet era).
I don't know what types of attention the SOTA closed source models are using, and they may well be using different techniques, but it'd not be surprising if there was "less attention" to tokens far back in the context. It's not obvious why this would affect a task like doing page-by-page OCR on a long PDF though, since there it's only the most recent page that needs attending to.
I've experienced this problem but I haven't come across papers about it. For this context, it would be interesting to compare the accuracy of transcribing one page at a time to batches of n pages.
Necessary? No. Better? Probably. Despite larger context windows, attention and hallucinations aren’t completely a thing of the past within the expanded context windows today. Splitting to individual pages likely helps ensure that you stay well within a normal context window size that seems to avoid most of these issues. Asking an LLM to maintain attention for a single page is much more achievable than an entire book.
Also, PDF size isn’t a relevant measurement of token lengths when it comes to PDFs which can range from a collection of high quality JPEG images to thousand(s) of pages of text
They all accept large PDFs (or any kind of input) but the quality of the output will suffer for various reasons.