Comment by mchadda_chunkr

10 months ago

Hi all - CEO of chunkr.ai here.

The write-up and ensuing conversation are really exciting. I think out of everything mentioned here - the clear stand-out point is that document layout analysis (DLA) is the crux of the issue for building practical doc ingestion for RAG.

(Note: DLA is the process of identifying and bounding specific segments of a document - like section headers, tables, formulas, footnotes, captions, etc.)

Strap in - this is going to be a longy.

We see a lot of people and products basically sending complete pages to LVLMs for converting to a machine-readable format, and for chunking. We tried this + it’s a possible configuration on chunkr as well. It has never worked for our customers, or during extensive internal testing across documents from a variety of verticals. Here are SOME of the common problems:

- Most documents are dense. The model will not OCR everything and miss crucial parts.

- A bunch of hallucinated content thats tough to catch.

- Occasionally it will just refuse to give you anything. We’ve tried a bunch of different prompting techniques and the models return “<image>” or “||..|..” for an ENTIRE PAGE of content.

Despite this - it’s obvious that these ginormous neural nets are great for complex conversions like tables and formulas to HTML/Markdown & LateX. They also work great for describing images and converting charts to tables. But that’s the thing - they can only do this if you can pull out these document features individually as cropped images and have the model focus on small snippets of the document rather than the full page.

If you want knobs for speed, quality, and cost, the best approach is to work at a segment level rather than a page level. This is where DLA really shines - the downstream processing options are vast and can be fit to specific needs. You can choose what to process with simple + fast OCR (text-only segments like headers, paragraphs, captions), and what to send to a large model like Gemini (complex segments like tables, formulas, and images) - all while getting juicy bounding boxes for mapping citations. Combine this with solid reading order algos - and you get amazing layout-aware chunking that takes ~10ms.

We made RAG apps ourselves and attempted to index all ~600 million pages of open-access research papers for https://lumina.sh. This is why we built Chunkr - and it needed to be Open Source. You can self-host our solution and process 4 pages per second, scaling up to 11 million pages per month on a single RTX 4090, renting this hardware on Runpod costs just $249/month ($0.34/hour).

A VLM to do DLA sounds awesome. We've played around with this idea but found that VLMs don't come close to models where the architecture is solely geared toward these specific object detection tasks. While it would simplify the pipeline, VLMs are significantly slower and more resource-hungry - they can't match the speed we achieve on consumer hardware with dedicated models. Nevertheless, the numerous advances in the field are very exciting - big if true!

A note on costs:

There are some discrepancies between the API pricing of providers listed in this thread. Assuming 100000 pages + feature parity:

Chunkr API - 200 pages for $1, not 100 pages

AWS Textract - 40 pages for $1, not 1000 pages (No VLMs)

Llama Parse - 13 pages for $1, not 300

A note on RD-Bench:

We’ve been using Gemini 1.5 Pro for tables and other complex segments for a while, so the RD-bench is very outdated. We ran it again on a few hundred samples and got a 0.81 (also includes some notes on the bench itself). To the OP: it would be awesome if you could update your blog post!