Comment by llm_trw
17 days ago
This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.
You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.
You use an object detection model trained on documents to find the bounding boxes of each document section as _images_; each bounding box comes with a confidence score for free.
You then feed each box of text to a regular OCR model, also gives you a confidence score along with each prediction it makes.
You feed each image box into a multimodal model to describe what the image is about.
For tables, use a specialist model that does nothing but extract tables—models like GridFormer that aren't hyped to hell and back.
You then stitch everything together in an XML file because Markdown is for human consumption.
You now have everything extracted with flat XML markup for each category the object detection model knows about, along with multiple types of probability metadata for each bounding box, each letter, and each table cell.
You can now start feeding this data programmatically into an LLM to do _text_ processing, where you use the XML to control what parts of the document you send to the LLM.
You then get chunking with location data and confidence scores of every part of the document to put as meta data into the RAG store.
I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.
Not sure what service you're basing your calculation on but with Gemmini I've processed 10,000,000+ shipping documents (PDF and PNGs) of every concievable layout in one month at under $1000 and an accuracy rate of between 80-82% (humans were at 66%).
The longest part of the development timeline was establishing the accuracy rate and the ingestion pipeline, which itself is massively less complex than what your workflow sounds like: PDF -> Storage Bucket -> Gemini -> JSON response -> Database
Just to get sick with it we actually added some recusion to the Gemini step to have it rate how well it extracted, and if it was below a certain rating to rewrite its own instructions on how to extract the information and then feed it back into itself. We didn't see any improvement in accuracy, but it was still fun to do.
>Not sure what service you're basing your calculation on but with Gemmini
The table of costs in the blog post. At 500,000 pages per day the hardware fixed cost overcomes the software variable cost at day 240 and from then on you're paying an extra ~$100 per day to keep it running in the cloud. The machine also had to use extremely beefy GPUs to fit all the models it needed to. Compute utilization was between 5 to 10% which means that it's future proof for the next 5 years at the rate at which the data source was growing.
There is also the fact that it's _completely_ local. Which meant we could throw in every proprietary data source that couldn't leave the company at it.
>The longest part of the development timeline was establishing the accuracy rate and the ingestion pipeline, which itself is massively less complex than what your workflow sounds like: PDF -> Storage Bucket -> Gemini -> JSON response -> Database
Each company should build tools which match the skill level of their developers. If you're not comfortable training models locally with all that entails off the shelf solutions allow companies to punch way above their weight class in their industry.
That assumes that you're able to find a model that can match Gemini's performance - I haven't come across anything that comes close (although hopefully that changes).
1 reply →
Very cool! How are you storing it to a database - vectors? What do you do with the extracted data (in terms of being able to pull it up via some query system)?
In this use-case the customer just wanted data not currently in the warehouse inventory management system capatured, so here we converted a JSON response to a classic table row schema (where 1 row = 1 document) and now boom, shipping data!
However we do very much recommend storing the raw model responses for audit and then at least as vector embeddings to orient the expectation that the data will need to be utilized for vector search and RAG. Kind of like "while we're here why don't we do what you're going to want to do at some point, even if it's not your use-case now..."
2 replies →
> [with] an accuracy rate of between 80-82% (humans were at 66%)
Was this human-verified in some way? If not, how did you establish the facts-on-the-ground about accuracy?
Yup, unfortunately the only way to know how good an AI is at anything is to do the same way you'd do with a human: build a test that you know the answers to already. That's also why the accuracy evaluation was by far the most time intensive part of the development pipeline as we had to manually build a "ground-truth" dataset that we could evaluate the AI again.
I feel compelled to reply. You've made a bunch of assumptions, and presented your success (likely with a limited set of table formats) as the one true way to parse PDFs. There's no such thing.
In real world usage, many tables are badly misaligned. Headers are off. Lines are missing between rows. Some columns and rows are separated by colors. Cells are merged. Some are imported from Excel. There are dotted sub sections, tables inside cells etc. Claude (and now Gemini) can parse complex tables and convert that to meaningful data. Your solution will likely fail, because rules are fuzzy in the same way written language is fuzzy.
Recently someone posted this on HN, it's a good read: https://lukaspetersson.com/blog/2025/bitter-vertical/
> You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.
No, not like that, but often as nested Json or Xml. For financial documents, our accuracy was above 99%. There are many ways to do error checking to figure out which ones are likely to have errors.
> This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.
One should refrain making statements about cost without knowing how and where it'll be used. When processing millions of PDFs, it could be a problem. When processing 1000, one might prefer Gemini/other over spending engineering time. There are many apps where processing a single doc is say $10 in revenue. You don't care about OCR costs.
> I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.
The author presented techniques which worked for them. It may not work for you, because there's no one-size-fits-all for these kinds of problems.
Related discussion:
AI founders will learn the bitter lesson
https://news.ycombinator.com/item?id=42672790 - 25 days ago, 263 comments
The HN discussion contains a lot of interesting ideas, thanks for the pointer!
You're making an even less charitable set of assumptions:
1). I'm incompetent enough to ignore publicly available table benchmarks.
2). I'm incompetent enough to never look at poor quality data.
3). I'm incompetent enough to not create a validation dataset for all models that were available.
Needless to say you're wrong on all three.
My day rate is $400 + taxes per hour if you want to be run through each point and why VLMs like Gemini fail spectacularly and unpredictably when left to their own devices.
whoa, this is a really aggressive response. No one is calling you incompetent rather criticizing your assumptions.
> My day rate is $400 + taxes per hour if you want to be run through each point
Great, thanks for sharing.
bragging about billing $400 an hour LOL
Marker (https://www.github.com/VikParuchuri/marker) works kind of like this. It uses a layout model to identify blocks and processes each one separately. The internal format is a tree of blocks, which have arbitrary fields, but can all render to html. It can write out to json, html, or markdown.
I integrated gemini recently to improve accuracy in certain blocks like tables. (get initial text, then pass to gemini to refine) Marker alone works about as well as gemini alone, but together they benchmark much better.
I used sxml [0] unironically in this project extensively.
The rendering step for reports that humans got to see was a call to pandoc after the sxml was rendered to markdown - look ma we support powerpoint! - but it also allowed us to easily convert to whatever insane markup a given large (or small) language model worked best with on the fly.
[0] https://en.wikipedia.org/wiki/SXML
Why process separately, if there are ink smudges, photocopier glitches, etc. wouldn't it guess some stuff better from richer context, like acronyms in rows used across the other tables?
It's funny you astroturf your own project in a thread where another is presenting tangential info about their own
what does marker add on top of docling?
Docling is a great project, happy to see more people building in the space.
Marker output will be higher quality than docling output across most doc types, especially with the --use_llm flag. A few specific things we do differently:
This is a great comment. I will mention another benefit to this approach: the same pipeline works for PDFs that are digital-native and don't require OCR. After the object detection step, you collect the text directly from within the bounding boxes, and the text is error-free. Using Gemini means that you give this up.
You‘re describing yesterdays world. With the advancement of AI, there is no need for any of these many steps and stages of OCR anymore. There is no need for XML in your pipeline because Markdown is now equally suited for machine consumption by AI models.
The results we got 18 months ago are still better than the current gemini benchmarks at a fraction the cost.
As for markdown, great. Now how do you encode the meta data about the confidence of the model that the text says what it thinks it says? Becuase xml has this lovely thing called attributes that let's you keep a provenance record without a second database that's also readable by the llm.
Just commenting here so that I can find back to this comment later. You perfectly captured the AI hype in one small paragraph.
Hey, why settle for yesteryear's world, with better accuracy, lower costs and local deployment, if you can use today's new shinny tool, reinvent the wheel, put everything in the cloud, and get hallucination for free..
1 reply →
Just commenting here to say the GP is spot on.
If you already have a high optimized pipeline built yesterday, then sure keep using it.
But if you start dealing with PDF today, just use Gemini. Use the most human readable formats you can find because we know AI will be optimized on understanding that. Don't even think about "stitching XML files" blahblah.
2 replies →
For future reference if you click on the timestamp of a comment that will bring you to a screen that has a “favorite” link. Click that to add the comment to your favorite comments list, which you can find on your profile page.
> I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.
That is impressive. However, if someone needs to read a couple of hundred pages per day, there's no point in setting all that up.
Also, you neglected to mention the cost of setting everything up. The machine cost $20k; but your time, and cost to train yolo8, probably cost more than that. If you want to compare costs (find a point where local implementation such as this is better ROI), you should compare fully loaded costs.
Or, depending on your use case, you do it in one step and ask an LLM to extract data from a PDF.
What you describe is obviously better and more robust by a lot, but the LLM only approach is not "wrong". It’s simple, fast, easy to setup and understand, and it works. With less accuracy but it does work. Depending on the constraints, development budget and load it’s a perfectly acceptable solution.
We did this to handle 2000 documents per month and are satisfied with the results. If we need to upgrade to something better in the future we will, but in the mean time, it’s done.
Fwiw, I'm not convinced Gemini isn't using an document-based objection detection model for this, at least some parts of this or for some doc categories (especially common things like IDs, bills, tax forms, invoices & POs, shipping documents, etc that they've previously created document extractors for (as part of their DocAI cloud service).
I don't see why they would do that. The whole point of training a model like Gemini is that you train the model - if they want it to work great against those different categories of document the likely way to do it is to add a whole bunch of those documents to Gemini's regular training set.
Getting "bitter lesson" vibes from this post
The bitter lesson is very little of the sort.
If we had unlimited memory, compute and data we'd use a rank N tensor for an input of length N and call it a day.
Unfortunately N^N grows rather fast and we have to do all sorts of interesting engineering to make ML calculations complete before the heat death of the universe.
> Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation.
2 replies →
I think you are being pedantic here and business decisions aren't made based on purely cost but brittleness, maintenance, time to market.
You are assuming you can match Gemini's performance, Google's engineering resources and costs being constant in to the future.
1 reply →
Only thing I could find about GridFormer and tables was this: https://arxiv.org/pdf/2309.14962v1
But there is no GitHub link or details on the implementation. Only model available seems to be one for removing weather effects from images: https://github.com/TaoWangzj/GridFormer
Could you care to expand on how you would use GridFormer for extracting tables from images? Seems like it's not as trivial as using something like Excalibur or Tabula, both which seem more battle-tested.
That sounds like a sound approach. Are the steps easliy upgradable with better models? Also it sounds like you can use an character recognition model on single characters? Do you do extra checks for numerical characters?
This is exactly the wrong mentality to have about new technology.
Impressive. Can you share anything more about this project? 500k pages a day is massive and I can imagine why one would require that much throughput.
It was a financial company that needed a tool that would out perform Bloomberg terminal for traders and quants in markets where their coverage is spotty.
You mentioned Grid Former, i found a paper describing it (Grid Former: Towards Accurate Table Structure Recognition via Grid Prediction). How did you implemented it?
Do you know another model than gridformer to detect table that has an available implementation somewhere ?
We had to roll our own from research papers unfortunately.
The number one take away we got was to use much larger images than anything that anyone else ever mentioned to get good results. A rule of thumb was that if you print the png of the image it should be easily readable from 2m away.
The actual model is proprietary and stuck in corporate land forever.
I honestly can't tell if you are being serious. Is there any doubt that the "OCR pipeline" will just be an LLM and it's just a matter of time?
What you are describing is similar to how computer used to detect cats. You first extract edges, texture and gradient. Then use a sliding window and run a classifier. Then you use NMS to merge the bounding boxes.
What object detection model do you use?
Is tesseract even ML based? Oh, this piece of software is more than 19 years old, perhaps there are other ways to do good, cheap OCR now. Does Gemini have an OCR library, internally? For other LLMs, I had the feeling that the LLM scripts a few lines of python to do the actual heavy lifting with a common OCR framework.
Custom trained yolo v8. I've moved on since then and the work was done in 2023. You'd get better results for much less today.