I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large context window model in terms of ease-of-use. Ironically this vendor is the best known and most successful vendor for OCR'ing this specific type of PDF but many of our requests failed over to their human-in-the-loop process. Despite it not being their specialization switching to Gemini was a no-brainer after our testing. Processing time went from something like 12 minutes on average to 6s on average, accuracy was like 96% of that of the vendor and price was significantly cheaper. For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair". We probably could improve our prompt to clean up this data even further. Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.
Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!
This is spot on, any legacy vendor focusing on a specific type of PDF is going to get obliterated by LLMs. The problem with using an off-the-shelf provider like this is, you get stuck with their data schema. With an LLM, you have full control over the schema meaning you can parse and extract much more unique data.
The problem then shifts from "can we extract this data from the PDF" to "how do we teach an LLM to extract the data we need, validate its performance, and deploy it with confidence into prod?"
You could improve your accuracy further by adding some chain-of-thought to your prompt btw. e.g. Make each field in your json schema have a `reasoning` field beforehand so the model can CoT how it got to its answer. If you want to take it to the next level, `citations` in our experience also improves performance (and when combined with bounding boxes, is powerful for human-in-the-loop tooling).
Disclaimer: I started an LLM doc processing infra company (https://extend.app/)
> The problem then shifts from "can we extract this data from the PDF" to "how do we teach an LLM to extract the data we need, validate its performance, and deploy it with confidence into prod?"
A smart vendor will shift into that space - they'll use that LLM themselves, and figure out some combination of finetunes, multiple LLMs, classical methods and human verification of random samples, that lets them not only "validate its performance, and deploy it with confidence into prod", but also sell that confidence with an SLA on top of it.
I have some out-of-print books that I want to convert into nice pdf's/epubs (like, reference-quality)
1) I don't mind destroying the binding to get the best quality. Any idea how I do so?
2) I have a multipage double-sided scanner (fujitsu scansnap). would this be sufficient to do the scan portion?
3) Is there anything that determines the font of the book text and reproduces that somehow? and that deals with things like bold and italic and applies that either as markdown output or what have you?
4) how do you de-paginate the raw text to reflow into (say) an epub or pdf format that will paginate based on the output device (page size/layout) specification?
As a mere occasional customer I've been scanning 4 to 5 pages of the same document layout every week in gemini for half a year, and every single week the results were slightly different.
To note the docs are bilingual so it could affect the results, but what stroke me is the lack of consistency, and even with the same model, running it two or three times in a row gives different results.
That's fine for my usage, but that sounds like a nightmare if everytime Google tweaks their model, companies have to reajust their whole process to deal with the discrepancies.
And sticking with the same model for multiple years also sound like a captive situation where you'd have to pay premium for Google to keep it available for your use.
At temperature zero, if you're using the same API/model, this really should not be the case. None of the big players update their APIs without some name / version change.
Wait isn't there atleast a two step process here one is semantic segmentation followed by a method like texttract for text - to avoid hallucinations?
One cannot possibly say that "Text extracted by a multimodal model cannot hallucinate"?
> accuracy was like 96% of that of the vendor and price was significantly cheaper.
I would like to know how this 96% was tested. If you use a human to do random sample based testing, well how do you adjust the random sample for variations in distribution of errors that vary like a small set of documents could have 90% of the errors and yet they are only 1% of the docs?
One thing people always forget about traditional OCR providers (azure, tesseract, aws textract, etc.) is that they're ~85% accurate.
They are all probabilistic. You literally get back characters + confidence intervals. So when textract gives you back incorrect characters, is that a hallucination?
For an OCR company I imagine it is unconscionable to do this because if you would say OCR for an Oral History project for a library and you made hallucination errors, well you've replaced facts with fiction. Rewriting history? What the actual F.
Wouldn’t the temperature on something like OCR be very low. You want the same result every time. Isn’t some part of hallucination the randomness of temperature?
The LLM's are near perfect (maybe parsing I instead of 1) - if you're using the outputs in the context of RAG, your errors are likely much much higher in the other parts of your system. Spending a ton of time and money chasing 9's when 99% of your system's errors have totally different root causes seems like a bad use of time (unless they're not).
This sounds extremely like my old tax accounting job. OCR existed and "worked" but it was faster to just enter the numbers manually than fix all the errors.
Also, the real solution to the problem should have been for the IRS to just pre-fill tax returns with all the accounting data that they obviously already have. But that would require the government to care.
If Gemini can do semantic chunking at the same time as extraction, all for so cheap and with nearly perfect accuracy, and without brittle prompting incantation magic, this is huge.
If I used Gemini 2.0 for extraction and chunking to feed into a RAG that I maintain on my local network, then what sort of locally-hosted LLM would I need to gain meaningful insights from my knowledge base? Would a 13B parameter model be sufficient?
This is great, I just want to highlight out how nuts it is that we have spun up whole industries around extracting text that was typically printed from a computer, back into a computer.
There should be laws that mandates that financial information be provided in a sensible format: even Office Open XML would be better than this insanity. Then we can redirect all this wasted effort into digging ditches and filling them back in again.
I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.
>>I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.
For this specific use case you can also try edgartools[1] which is a library that was relatively recently released that ingests SEC submissions and filings. They don't use OCR but (from what I can tell) directly parse the XBRL documents submitted by companies and stored in EDGAR, if they exist.
How do today’s LLM’s like Gemini compare with the Document Understanding services google/aws/azure have offered for a few years, particularly when dealing with known forms? I think Google’s is Document AI.
I've found the highest accuracy solution is to OCR with one of the dedicated models then feed that text and the original image into an LLM with a prompt like:
member of the gemini team here -- personally, i'd recommend directly using gemini vs the document understanding services for OCR & general docs understanding tasks. From our internal evals gemini is now stronger than these solutions and is only going to get much better (higher precision, lower hallucination rates) from here.
GCP's Document AI service is now literally just a UI layer specific to document parsing use-cases back by Gemini models. When we realized that we dumped it and just use Gemini directly.
They will, and they'll still have a solid product to sell, because their value proposition isn't accurate OCR per se, but putting an SLA on it.
Reaching reliability with LLM OCR might involve some combination of multiple LLMs (and keeping track of how they change), perhaps mixed with old-school algorithms, and random sample reviews by humans. They can tune this pipeline however they need at their leisure to eke out extra accuracy, and then put written guarantees on top, and still be cheaper for you long-term.
With “Next generation, extremely sophisticated AI” to be precise, I wait say. ;)
Marketing joke aside, maybe a hybrid approach could serve the vendor well. Best of both worlds if it reaps benefits or even have a look at hugging face for even more specialized aka better LLMs.
I'm curious to hear about your experience with this. Which solution were you using before (the one that took 12 minutes)? If it was a self-hosted solution, what hardware were you using? How does Gemini handle PDFs with an unknown schema, and how does it compare to other general PDF parsing tools like Amazon Textract or Azure Document Intelligence? In my initial test, tables and checkboxes weren't well recognized.
> For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair".
I'm actually somewhat surprised Gemini didn't guess from context that LLC is much more likely?
I guess the OCR subsystem is intentionally conservative? (Though I'm sure you could do a second step on your end, take the output from the conservative OCR pass, and sent it through Gemini and ask it to flag potential OCR problems? I bet that would flag most of them with very few false positives and false negatives.)
but were often written with typewriters long ago to get nice structured tabular output. Deals with text being split across lines and across pages just fine.
It is cheaper now, but I wonder if it will continue to be cheaper when companies like Google and OpenAI decide they want to make a profit off of AI, instead of pouring billions of dollars of investment funds into it. By the time that happens, many of the specialized service providers will be out of business and Google will be free to jack up the price.
I use Claude through OpenRouter (with Aider), and was pretty amazed to see that it routes the requests during the same session almost round-robin through Amazon Bedrock, sometimes through Google Vertex, sometimes through Anthropic themselves, all of course using the same underlying model.
Literally whoever has the cheapest compute.
With the speed that AI models are improving these days, it seems like the 'moat' of a better model is only a few months before it is commoditized and goes to the cheapest provider.
I’ve been wanting to build a system that ingests pdf reports that reference other types of data like images, csv, etc. that can also be ingested to ultimately build an analytics database from the stack of unsorted data AB’s meta data but I have not found any time to do anything like that yet. What kind of tooling do you use to build your data pipelines?
It's great to hear it's this good, and it makes sense since Google has had several years of experience creating document-type-specific OCR extractors as components of their Document AI product in Cloud. What most heartening is to hear that the legwork they did for that set of solutions has made it into Gemini for consumers (and businesses).
Successful document processing vendors to use LLMs already. I know this at least of klippa. They have (apparently) fine-tuned models, prompts etc. The biggest issue with using LLMs directly is error handling, validation and "parameter drift"/randomness. This is the typical "I'll build it myself but worse" thing
I'm interested to hear what your experience has been dealing with optional data. For example if the input pdf has fields which are sometimes not populated or nonexistent, is Gemini smart enough to leave those fields blank in the output schema? Usually the LLM tries to please you and makes up values here.
You could ingest them with AWS Textract and have predictability and formatting in the format of your choice. Using LLMs for this is lazy and generates unpredictable and non-deterministic results.
Did you try other vision models such as ChatGPT and Grok? I'm doing something similar but struggled to find good comparisons in between the vision models in terms OCR and document understanding.
If the documents have the same format, maybe you could include an example document in the prompt, so the boilerplate stuff (like LLC) gets handled properly.
> Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.
This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.
You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.
You use an object detection model trained on documents to find the bounding boxes of each document section as _images_; each bounding box comes with a confidence score for free.
You then feed each box of text to a regular OCR model, also gives you a confidence score along with each prediction it makes.
You feed each image box into a multimodal model to describe what the image is about.
For tables, use a specialist model that does nothing but extract tables—models like GridFormer that aren't hyped to hell and back.
You then stitch everything together in an XML file because Markdown is for human consumption.
You now have everything extracted with flat XML markup for each category the object detection model knows about, along with multiple types of probability metadata for each bounding box, each letter, and each table cell.
You can now start feeding this data programmatically into an LLM to do _text_ processing, where you use the XML to control what parts of the document you send to the LLM.
You then get chunking with location data and confidence scores of every part of the document to put as meta data into the RAG store.
I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.
Not sure what service you're basing your calculation on but with Gemmini I've processed 10,000,000+ shipping documents (PDF and PNGs) of every concievable layout in one month at under $1000 and an accuracy rate of between 80-82% (humans were at 66%).
The longest part of the development timeline was establishing the accuracy rate and the ingestion pipeline, which itself is massively less complex than what your workflow sounds like: PDF -> Storage Bucket -> Gemini -> JSON response -> Database
Just to get sick with it we actually added some recusion to the Gemini step to have it rate how well it extracted, and if it was below a certain rating to rewrite its own instructions on how to extract the information and then feed it back into itself. We didn't see any improvement in accuracy, but it was still fun to do.
>Not sure what service you're basing your calculation on but with Gemmini
The table of costs in the blog post. At 500,000 pages per day the hardware fixed cost overcomes the software variable cost at day 240 and from then on you're paying an extra ~$100 per day to keep it running in the cloud. The machine also had to use extremely beefy GPUs to fit all the models it needed to. Compute utilization was between 5 to 10% which means that it's future proof for the next 5 years at the rate at which the data source was growing.
There is also the fact that it's _completely_ local. Which meant we could throw in every proprietary data source that couldn't leave the company at it.
>The longest part of the development timeline was establishing the accuracy rate and the ingestion pipeline, which itself is massively less complex than what your workflow sounds like: PDF -> Storage Bucket -> Gemini -> JSON response -> Database
Each company should build tools which match the skill level of their developers. If you're not comfortable training models locally with all that entails off the shelf solutions allow companies to punch way above their weight class in their industry.
Very cool! How are you storing it to a database - vectors? What do you do with the extracted data (in terms of being able to pull it up via some query system)?
I feel compelled to reply. You've made a bunch of assumptions, and presented your success (likely with a limited set of table formats) as the one true way to parse PDFs. There's no such thing.
In real world usage, many tables are badly misaligned. Headers are off. Lines are missing between rows. Some columns and rows are separated by colors. Cells are merged. Some are imported from Excel. There are dotted sub sections, tables inside cells etc. Claude (and now Gemini) can parse complex tables and convert that to meaningful data. Your solution will likely fail, because rules are fuzzy in the same way written language is fuzzy.
> You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.
No, not like that, but often as nested Json or Xml. For financial documents, our accuracy was above 99%. There are many ways to do error checking to figure out which ones are likely to have errors.
> This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.
One should refrain making statements about cost without knowing how and where it'll be used. When processing millions of PDFs, it could be a problem. When processing 1000, one might prefer Gemini/other over spending engineering time. There are many apps where processing a single doc is say $10 in revenue. You don't care about OCR costs.
> I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.
The author presented techniques which worked for them. It may not work for you, because there's no one-size-fits-all for these kinds of problems.
You're making an even less charitable set of assumptions:
1). I'm incompetent enough to ignore publicly available table benchmarks.
2). I'm incompetent enough to never look at poor quality data.
3). I'm incompetent enough to not create a validation dataset for all models that were available.
Needless to say you're wrong on all three.
My day rate is $400 + taxes per hour if you want to be run through each point and why VLMs like Gemini fail spectacularly and unpredictably when left to their own devices.
Marker (https://www.github.com/VikParuchuri/marker) works kind of like this. It uses a layout model to identify blocks and processes each one separately. The internal format is a tree of blocks, which have arbitrary fields, but can all render to html. It can write out to json, html, or markdown.
I integrated gemini recently to improve accuracy in certain blocks like tables. (get initial text, then pass to gemini to refine) Marker alone works about as well as gemini alone, but together they benchmark much better.
I used sxml [0] unironically in this project extensively.
The rendering step for reports that humans got to see was a call to pandoc after the sxml was rendered to markdown - look ma we support powerpoint! - but it also allowed us to easily convert to whatever insane markup a given large (or small) language model worked best with on the fly.
Why process separately, if there are ink smudges, photocopier glitches, etc. wouldn't it guess some stuff better from richer context, like acronyms in rows used across the other tables?
This is a great comment. I will mention another benefit to this approach: the same pipeline works for PDFs that are digital-native and don't require OCR. After the object detection step, you collect the text directly from within the bounding boxes, and the text is error-free. Using Gemini means that you give this up.
You‘re describing yesterdays world. With the advancement of AI, there is no need for any of these many steps and stages of OCR anymore. There is no need for XML in your pipeline because Markdown is now equally suited for machine consumption by AI models.
The results we got 18 months ago are still better than the current gemini benchmarks at a fraction the cost.
As for markdown, great. Now how do you encode the meta data about the confidence of the model that the text says what it thinks it says? Becuase xml has this lovely thing called attributes that let's you keep a provenance record without a second database that's also readable by the llm.
> I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.
That is impressive. However, if someone needs to read a couple of hundred pages per day, there's no point in setting all that up.
Also, you neglected to mention the cost of setting everything up. The machine cost $20k; but your time, and cost to train yolo8, probably cost more than that. If you want to compare costs (find a point where local implementation such as this is better ROI), you should compare fully loaded costs.
Or, depending on your use case, you do it in one step and ask an LLM to extract data from a PDF.
What you describe is obviously better and more robust by a lot, but the LLM only approach is not "wrong". It’s simple, fast, easy to setup and understand, and it works. With less accuracy but it does work. Depending on the constraints, development budget and load it’s a perfectly acceptable solution.
We did this to handle 2000 documents per month and are satisfied with the results. If we need to upgrade to something better in the future we will, but in the mean time, it’s done.
Fwiw, I'm not convinced Gemini isn't using an document-based objection detection model for this, at least some parts of this or for some doc categories (especially common things like IDs, bills, tax forms, invoices & POs, shipping documents, etc that they've previously created document extractors for (as part of their DocAI cloud service).
I don't see why they would do that. The whole point of training a model like Gemini is that you train the model - if they want it to work great against those different categories of document the likely way to do it is to add a whole bunch of those documents to Gemini's regular training set.
If we had unlimited memory, compute and data we'd use a rank N tensor for an input of length N and call it a day.
Unfortunately N^N grows rather fast and we have to do all sorts of interesting engineering to make ML calculations complete before the heat death of the universe.
But there is no GitHub link or details on the implementation. Only model available seems to be one for removing weather effects from images: https://github.com/TaoWangzj/GridFormer
Could you care to expand on how you would use GridFormer for extracting tables from images? Seems like it's not as trivial as using something like Excalibur or Tabula, both which seem more battle-tested.
That sounds like a sound approach. Are the steps easliy upgradable with better models? Also it sounds like you can use an character recognition model on single characters? Do you do extra checks for numerical characters?
It was a financial company that needed a tool that would out perform Bloomberg terminal for traders and quants in markets where their coverage is spotty.
You mentioned Grid Former, i found a paper describing it (Grid Former: Towards Accurate Table Structure Recognition via Grid Prediction). How did you implemented it?
We had to roll our own from research papers unfortunately.
The number one take away we got was to use much larger images than anything that anyone else ever mentioned to get good results. A rule of thumb was that if you print the png of the image it should be easily readable from 2m away.
The actual model is proprietary and stuck in corporate land forever.
I honestly can't tell if you are being serious. Is there any doubt that the "OCR pipeline" will just be an LLM and it's just a matter of time?
What you are describing is similar to how computer used to detect cats. You first extract edges, texture and gradient. Then use a sliding window and run a classifier. Then you use NMS to merge the bounding boxes.
Is tesseract even ML based? Oh, this piece of software is more than 19 years old, perhaps there are other ways to do good, cheap OCR now.
Does Gemini have an OCR library, internally?
For other LLMs, I had the feeling that the LLM scripts a few lines of python to do the actual heavy lifting with a common OCR framework.
Isn't it amazing how one company invented a universally spread format that takes structured data from an editor (except images obviously) and converts it into a completely fucked-up unstructured form that then requires expensive voodoo magic to convert back into structured data.
PDF provide that capability, but editors don't produce it, probably because printing is though OS drivers that don't support it, or PDF generators that don't support it. Or they do support it but users don't know to check that option, or turn it off because it makes PDFs too large.
It's not the structure that allows meaningful understanding.
Something that was clearly a table now becomes a bunch of glphy's physically close to eachother vs a group of other glphys but when considered as a group is a box visually separated from another group of glphys but actually part of a table.
We are driving full speed into a xerox 2.0 moment and this time we are doing so knowingly. At least with xerox, the errors were out of place and easy to detect by a human. I wonder how many innocent people will lose their lives or be falsely incarcerated because of this. I wonder if we will adapt our systems and procedures to account for hallucinations and "85%" accuracy.
And no, outlawing use the use of AI or increasing liability with its use will have next to nothing to deter its misuse and everyone knows it. My heart goes out to the remaining 15%.
I love generative AI as a technology. But the worst thing about its arrival has been the reckless abandonment of all engineering discipline and common sense. It’s embarrassing.
the first thing that guy says that existing non-AI solutions are not that great. then he says that AI beats them in the accuracy. so i don't quite understand the point you're trying to make here
Humans accept a degree of error for convenience. (driving is one of them). But no, 15% is not the acceptable rate. More like 0.15% to 0.015% depending on the country.
(disclaimer I am CEO of llamaindex, which includes LlamaParse)
Nice article! We're actively benchmarking Gemini 2.0 right now and if the results are as good as implied by this article, heck we'll adapt and improve upon it. Our goal (and in fact the reason our parser works so well) is to always use and stay on top of the latest SOTA models and tech :) - we blend LLM/VLM tech with best-in-class heuristic techniques.
Some quick notes:
1. I'm glad that LlamaParse is mentioned in the article, but it's not mentioned in the performance benchmarks. I'm pretty confident that our most accurate modes are at the top of the table benchmark - our stuff is pretty good.
2. There's a long tail of issues beyond just tables - this includes fonts, headers/footers, ability to recognize charts/images/form fields, and as other posters said, the ability to have fine-grained bounding boxes on the source elements. We've optimized our parser to tackle all of these modes, and we need proper benchmarks for that.
3. DIY'ing your own pipeline to run a VLM at scale to parse docs is surprisingly challenging. You need to orchestrate a robust system that can screenshot a bunch of pages at the right resolution (which can be quite slow), tune the prompts, and make sure you're obeying rate limits + can retry on failure.
The very first (and probably hand-picked & checked) example on your website [0] suffers from the very problem people are talking about here - in "Fiscal 2024" row it contains an error for CEO CAP column. On the image it says "$234.1" but the parsed result says "$234.4". A small error, but error nonetheless. I wonder if we can ever fix these kind of errors with LLM parsing.
I'm a happy customer. I wrote a ruby client for your API and have been parsing thousands of different types of PDFs through it with great results. I tested almost everything out there at the time and I couldn't find anything that came close to being as good as llamaparse.
Indeed, this is also my experience. I have tried a lot of things and where quality is more important than quantity, I doubt there are many tools that can come close to Llamaparse.
All your examples are exquisitely clean digital renders of digital documents. How does it fare with real scans (noise, folds) or photos? Receipts?
Or is there a use case for digital non-text pdfs? Are people really generating image and not text-based PDFs? Or is the primary use case extracting structure, rather than text?
How well does llamaparse work on foreign-language documents?
I have pipeline for Arabic-language docs using Azure for OCR and GPT-4o-mini to extract structured information. Would it be worth trying llamaparse to replace part of the pipeline or the whole thing?
I've been using NotebookLM powered by Gemini 2.0 for three projects and it is _really powerful_ for comprehending large corpuses you can't possibly read and thinking informed by all your sources. It has solid Q&A. When you ask a question or get a summary you like [which often happens] you can save it as a new note, putting it into the corpus for analysis. In this way your conclusions snowball. Yes, this experience actually happens and it is beautiful.
I've tried Adobe Acrobat AI for this and it doesn't work yet. NotebookLM is it. The grounding is the reason it works - you can easily click on anything and it will take you to the source to verify it. My only gripe is that the visual display of the source material is _dogshit ugly_, like exceptionally so. Big blog pink background letters in lines of 24 characters! :) It has trouble displaying PDF columns, but at least it parses them. The ugly will change I'm sure :)
My projects are setup to let me bridge the gaps between the various sources and synthesize something more. It helps to have a goal and organize your sources around that. If you aren't focused, it gets confused. You lay the groundwork in sources and it helps you reason. It works so well I feel _tender_ towards it :) Survey papers provide background then you add specific sources in your area of focus. You can write a profile for how you would like NotebookLM to think - which REALLY helps out.
They are:
* The Stratigrapher - A Lovecraftian short story about the world's first city.
All of Seton Lloyd/Faud Safar's work on Eridu.
Various sources on Sumerian culture and religion
All of Lovecraft's work and letters.
Various sources about opium
Some articles about nonlinear geometries
* FPGA Accelerated Graph Analytics
An introduction to Verilog
Papers on FPGAs and graph analytics
Papers on Apache Spark architecture
Papers on GraphFrames and a related rant I created about it and graph DBs
A source on Spark-RAPIDS
Papers on subgraph matching, graphlets, network motifs
Papers on random graph models
* Graph machine learning notebook without a specific goal, which has been less successful. It helps to have a goal for the project. It got confused by how broad my sources were.
I would LOVE to share my projects with you all, but you can only share within a Google Workspaces domain. It will be AWESOME when they open this thing up :)
thanks a ton for all the amazing feedback on this thread! if
(a) you have document understanding use cases that you'd like to use gemini for (the more aspirational the better) and/or
(b) there are loss cases for which gemini doesn't work well today,
please feel free to email anirudhbaddepu@google.com and we'd love to help get your use case working & improve quality for our next series of model updates!
We parse millions of PDFs using Apache Tika and process about 30,000 per dollar of compute cost. However, the structured output leaves something to be desired, and there are a significant number of pages that Tika is unable to parse.
Under the hood tika uses tesseract for ocr parsing. For clarity this all works surprisingly well generally speaking and it’s pretty easy to run your self and order of magnitude cheaper than most services out there.
In my mind, Gemini 2.0 changes everything because of the incredibly long context (2M tokens on some models), while having strong reasoning capabilities.
We are working on compliance solution (https://fx-lex.com) and RAG just doesn’t cut it for our use case. Legislation cannot be chunked if you want the model to reason well about it.
It’s magical to be able to just throw everything into the model. And the best thing is that we automatically benefit from future model improvements along all performance axes.
Gemini models run in the cloud, so there is no issue with hardware.
The EU regulations typically include delegated acts, technical standards, implementation standards and guidelines. With Gemini 2.0 we are able to just throw all of this into the model and have it figure out.
This approach gives way better results than anything we are able to achieve with RAG.
My personal bet is that this is how the future will look like. RAG will remain relevant, but only for extremely large document corpuses.
We haven't tried that, we might do that in the future.
My intuition - not based on any research - is that recall should be a lot better from in context data vs. weights in the model. For our use case, precise recall is paramount.
Somewhat tangential, but the EU has a directive mandating electronic invoicing for public procurement.
One of the standards that has come out of that is EN 16931, also known as ZUGFeRD and Factur-X, which basically involves embedding an XML file with the invoice details inside a PDF/A. It allows the PDF to be used like a regular PDF but it also allows the government procurement platforms to reliably parse the contents without any kind of intelligence.
It seems like a nice solution that would solve a lot of issues with ingesting PDFs for accounting if everyone somehow managed to agree a standard. Maybe if EN 16931 becomes more broadly available it might start getting used in the private sector too.
> Unfortunately Gemini really seems to struggle on this, and no matter how we tried prompting it, it would generate wildly inaccurate bounding boxes
Qwen2.5 VL was trained on a special HTML format for doing OCR with bounding boxes. [1] The resulting boxes aren't quite as accurate as something like Textract/Surya, but I've found they're much more accurate than Gemini or any other LLM.
>Unfortunately Gemini really seems to struggle on this, and no matter how we tried prompting it, it would generate wildly inaccurate bounding boxes
This is what I have found as well. From what I've read, LLMS do not work well with images for specific details due to image encoders which are too lossy. (No idea if this is actually correct.) For now I guess you can use regular OCR to get bounding boxes.
Modern multimodal encoders for LLMs are fine/not lossy since they do not resize to a small size and can handle arbitrary sizes, although some sizes are obviously better represented in the training set. A 8.5" x 11" paper would be common.
I suspect the issue is prompt engineering related.
> Please provide me strict bounding boxes that encompasses the following text in the attached image? I'm trying to draw a rectangle around the text.
> - Use the top-left coordinate system
> - Values should be percentages of the image width and height (0 to 1)
LLMs have enough trouble with integers (since token-wise integers and text representation of integers are the same), high-precision decimals will be even worse. It might be better to reframe the problem as "this input document is 850 px x 1100 px, return the bounding boxes as integers" then parse and calculate the decimals later.
It's clear that OCR & document parsing are going to be swallowed up by these multimodal models. The best representation of a document at the end of the day is an image.
I founded a doc processing company [1] and in our experience, a lot of the difficulty w/ deploying document processing into production is when accuracy requirements are high (> 97%). This is because OCR and parsing is only one part of the problem, and real world use cases need to bridge the gap between raw outputs and production-ready data.
This requires things like:
- state-of-the-art parsing powered by VLMs and OCR
- multi-step extraction powered by semantic chunking, bounding boxes, and citations
- processing modes for document parsing, classification, extraction, and splitting (e.g. long documents, or multi-document packages)
- tooling that lets nontechnical members quickly iterate, review results, and improve accuracy
- evaluation and benchmarking tools
- fine-tuning pipelines that turn reviewed corrections —> custom models
Very excited to get test and benchmark Gemini 2.0 in our product, very excited about the progress here.
> It's clear that OCR & document parsing are going to be swallowed up by these multimodal models.
I don’t think this is clear at all. A multimodal LLM can and will hallucinate data at arbitrary scale (phrases, sentences, etc.). Since OCR is the part of the system that extracts the “ground truth” out of your source documents, this is an unacceptable risk IMO.
Seems like you could solve hallucinations by repeating the task multiple times. Non-hallucinations will be the same. Hallucinations will be different. Discard and retry hallucinated sections. This increases cost by a fixed multiple, but if cost of tokens continues to fall that's probably perfectly fine.
I think professional services will continue to use OCRs in one way or another, because it's simply too cheap, fast, and accurate. Perhaps, multi-modal models can help address shortcomings of OCRs, like layout detection and guessing unrecognizable characters.
The numbers in the blog post seem VERY inaccurate.
Quick calculation:
Input pricing: Image input in 2.0 Flash is $0.0001935. Let's ignore the prompt.
Output pricing: Let's assume 500 token per page, which is $0.0003
Cost per page: $0.0004935
That means 2,026 pages per dollar. Not 6,000!
Might still be cheaper than many solutions but I don't see where these numbers are coming from.
By the way, image input is much more expensive in Gemini 2.0 even for 2.0 Flash Lite.
Edit: The post says batch pricing, which would be 4k pages based on my calculation. Using batch pricing is pretty different though. Great if feasible but not practical in many contexts.
Correct, it's with batching Vertex pricing with slightly lower output tokens per page since a lot of pages are somewhat empty in real world docs - I wanted a fair comparison to providers that charge per page.
Regardless of what assumptions you use - it's still an order of magnitude + improvement over anything else.
I've not followed the literature very closely for some time - what problem are they trying to solve in the first place? They write "for documents to be effectively used in RAG pipelines, they must be split into smaller, semantically meaningful chunks". Segmenting each page by paragraphs doesn't seem like a particularly hard vision problem, nor do I see why an OCR system would need to incorporate an LLM (which seem more like a demonstration of overfitting than a "language model" in any literal sense, going by ChatGPT). Perhaps I'm just out of the loop.
Finally, I must point out that statements in the vein of "Why [product] 2.0 Changes Everything" are more often than not a load of humbug.
Great article, I couldn't find any details about the prompt... only the snippets of the `CHUNKING_PROMPT` and the `GET_NODE_BOUNDING_BOXES_PROMPT`.
Is there is any code example with a full prompt available from OP, or are there any references (such as similar GitHub repos) for those looking to get started within this topic?
I have some out-of-print books that I want to convert into nice pdf's/epubs (like, reference-quality)
1) I don't mind destroying the binding to get the best quality. How do I do so?
2) I have a multipage double-sided scanner (fujitsu scansnap). would this be sufficient to do the scan portion?
3) Is there anything that determines the font of the book text and reproduces that somehow? and that deals with things like bold and italic and applies that either as markdown output or what have you?
4) how do you de-paginate the raw text to reflow into (say) an epub format that will paginate based on the output device specification?
I think it is very ironic that we chose to use PDF in many fields to archive data because it is a standard and because we would be able to open our pdf documents in 50 or 100 years time. So here we are just a couple of years later facing the challenge of getting the data out of our stupid PDF documents already!
It's not ironic. PDFs are a container, which can hold scanned documents as well as text. Scanned documents need OCR and to be analyzed for their layout. This is not a failing of the PDF format, but a problem inherent to working with print scans.
I don't claim PDF is a good format. It is inscrutable to me.
I work in healthcare domain, We've had great success converting printed lab reports (95%) to Json format using 1.5-Flash model. This post is really exciting for me. will definitely try out 2.0 models.
The struggle which almost every ocr usecase faces is with handwritten documents(doctor prescriptions with bad handwriting) With gemini 1.5 flash we've had ~75-80% percent accuracy (based on random sampling by pharmacists). we're planning to improve this further by fine-tuning gemini models with medical data.
What could be other alternative services/models for accurate handwriting ocr?
I'm guessing that human accuracy may be lower or around that value, given that handwritten notes are generally difficult to read. A better metric for document parsing might be accuracy relative to human performance (how much better the LLM performs compared to a human).
Hrm I've been using a combo of Textract (for bounding boxes) AI for understanding the contents of the document. Textract is excellent at bounding boxes and exact-text capture, but LLMs are excellent at understanding when a messy/ugly bit of a form is actually one question, or if there are duplicate questions etc.
Correlating the two (Textract <-> AI) output is difficult, but another round of AI is usually good at that. Combined with some text-different scoring and logic, I can get pretty good full-document understanding of questions and answer locations. I've spent a pretty absurd amount of time on this and as of yet have not launched a product with it, but if anyone is interested I'd love to chat about the pipeline!
Been toying with the flash model. Not the top model, but think it'll see plenty use due to the details. Wins on things other than top of benchmark logs
* Generous free tier
* Huge context window
* Lite version feels basically instant
However
* Lite model seems more prone to repeating itself / looping
* Very confusing naming e.g. {model}-latest worked for 1.5 but now its {model}-001? The lite has a date appended, the non-lite does not. Then there is exp and thinking exp...which has a date. wut?
But how well does it actually handle that context window? E.g. a lot of models support 200K context, but the LLM can only really work with ~80K or so of it before it starts to get confused.
it works REALLY well. I have used it to dump many references codes and then help me write a new modules etc. I have gone up to 200k tokens I think with no problems in recall.
There is the needle in the haystack measure which is, as you probably guessed, hiding a small fact in a massive set of tokens and asking it to recall it.
Recent Gemini models actually do extraordinarily well.
It works okay out to roughly 20-40k tokens. Once the window gets larger than that, it degrades significantly. You can needle in the haystack out to that distance, but asking it for multiple things from the document leads to hallucinations for me.
Ironic, but GPT4o works better for me at longer contexts <128k than Gemini 2.0 flash. And out to 1m is just hopeless, even though you can do it.
Ingesting PDFs accurately is a noble goal which will no doubt be solved as LLMs get better. However, I need to point out that the financial statement example used in the article already has a solution: iXBRL.
Many financial regulators require you to publish heavily marked up statements with iXBRL. These markups reveal nuances in the numbers that OCRing a post processed table will not understand.
Of course, financial documents are a narrow subset of the problem.
Maybe the problem is with PDF as a format: Unfortunately PDFs lose that meta information when they are built from source documents.
I can't help but feel that PDFs could probably be more portable as their acronym indicates.
Glad Gemini is getting some attention. Using it is like a superpower. There are so many discussions about ChatGTP, Claude, DeepSeek, Llama, etc. that don't even mention Gemini.
Before 2.0 models their offerings were pretty underwhelming, but now they can certainly hold their own. I think Gemini will ultimately be the LLM that eats the world, Google has the talent and most importantly has their own custom hardware (hence why their prices are dirt cheap and context is huge).
Google had a pretty rough start compared to ChatGPT, Claude. I suspect that left a bad taste in many people's mouths. In particular because evaluating so many LLM's is a lot of effort on its own.
Llama and DeepSeek are no-brainers; the weights are public.
Google was not serious about LLMs, they could not even figure what to call it. There is always a risk that they will get bored and just kill the whole thing.
I tried using Gemini 2.0 Flash for PDF-to-Markdown parsing of scientific papers after having good results with GPT-4o, but the experience was terrible.
When I sent images of PDF page with extracted text, Gemini mixed headlines with body text, parsed tables incorrectly, and sometimes split tables—placing one part at the top of the page and the rest at the bottom. It also added random numbers (like inserting an “8” for no reason).
When using the Gemini SDK to process full PDFs, Gemini 1.5 could handle them, but Gemini 2.0 only processed the first page. Worse, both versions completely ignored tables.
Among the Gemini models, 1.5 Pro performed the best, reaching about 80% of GPT-4o’s accuracy with image parsing, but it still introduced numerous small errors.
In conclusion, no Gemini model is reliable for PDF-to-Markdown parsing and beyond the hype - I still need to use GPT-4o.
Docling has worked well for me. It handles scenarios that crashed ChatGPT Pro. Only problem is it's super annoying to install. When I have a minute I might package it for homebrew.
If it's superior (esp. for scans with text flowing around image boxes), and if you do end up packaging it up for brew, know that there's at least one developer who will benefit from your work (for a side-project, but that goes without saying).
I have seen no decent program that can read, OCR, and analyze, and tabulate data correctly from very large PDF files with a lot of scanned information from different sources. I run my practice with pdf files- one for each patient. It is a treasure trove of actionable data. PDF filing in this manner allows me to finish my daily tasks in 4 hrs instead of 12 hrs! For sick patients who need information at the point of care, PDF has numerous advantages over usual hospital EHR portals, etc. If any smart Engineer/s are interested in working with me, please connect with me
I can help as can many others. Probably a good place to start though is with some of the more recent off the shelf solutions like trellis (I have no affiliation with them).
One major takeaway that matches my own investigation is that Gemini 2.0 still materially struggles with bounding boxes on digital content. Google has published[1] some great material on spatial understanding and bounding boxes on photography, but identifying sections of text or digital graphics like icons in a presentation is still very hit and miss.
Have you seen any models that perform better at this? I last looked into this a year ago but at the time they were indeed quite bad at it across the board.
What would change "everything" is if we managed to switch to "real" digital parseable formats instead of this dead tree emulation that buries all data before the arrival of AI...
Most PDF parsers give you coordinate data (bounding boxes) for extracted text. Use these to draw highlights over your PDF viewer - users can then click the highlights to verify if the extraction was correct.
The tricky part is maintaining a mapping between your LLM extractions and these coordinates.
One way to do it would be with two LLM passes:
1. First pass: Extract all important information from the PDF
2. Second pass: "Hey LLM, find where each extraction appears in these bounded text chunks"
Not the cheapest approach since you're hitting the API twice, but it's straightforward!
I’ve been very reluctant to use closed source LLMs. This might actually convince me to use one. I’ve done so many attempts at pdf parsing over the years. It’s awful to deal with. 2 column format omg. Most don’t realize that pdfs contain instructions for displaying the document and the content is buried in there. It’s just always been a problematic format.
Two years ago, I worked for a company that had its own proprietary AI system for processing PDFs. While the system handled document ingestion, its real value was in extracting and analyzing data to provide various insights. However, one key requirement was rendering documents in HTML with as close to a 1:1 likeness as possible.
At the time, I evaluated multiple SDKs for both OCR and non-OCR PDF conversions, but none matched the accuracy of Adobe Acrobat’s built-in solution. In fact, at one point (don’t laugh), the company resorted to running Adobe Acrobat on a Windows machine with automation tools to handle the conversion. Using Adobe’s cloud service for conversion was not an option due to the proprietary nature of the PDFs. Additionally, its results were inconsistent and often worse compared to the desktop version of Adobe Acrobat!
Given that experience, I see this primarily as an HTML/text conversion challenge. If Gemini 2.0 truly improves upon existing solutions, it would be interesting to see a direct comparison against popular proprietary tools in terms of accuracy.
We started with using LLMs for parsing at Tensorlake (https://docs.tensorlake.ai), tried Qwen, Gemini, OpenAI, pretty much everything under the sun. My thought was we could skip 5-6 years of development IDP companies have done on specialized models by going to LLMs.
On information dense pages, LLMs often hallucinate half of the times, they have trouble understanding empty cells in tables, doesn't understand checkboxes, etc.
We had to invest heavily into building a state of the art layout understanding model and finally a table structure understanding for reliability. LLMs will get there, but there are some ways to go there.
Where they do well is in VQA type use cases, ask a question, very narrowly scoped, they will work much better than OCR+Layout models, because they are much more generalizable and flexible to use.
Good post. VLM models are improving and Gemini 2.0 definitely changes the doc prep and ingestion pipeline across the board.
What we're finding as we work with enterprise customers:
1. Attribution is super important, and VLMs are there yet. Combining them with layout analysis makes for a winning combo.
2. VLMs are great at prompt-based extraction, but if you have document automation and you don't know where in tables you'll be searching or need to reproduce faithfully -- then precise table extraction is important.
3. VLMs will continue to get better, but the price points are a result of economies of scale that document parsing vendors don't get. On the flip side, document parsing vendors have deployment models that Gemini can't reach.
Shameless plug: I'm working on a startup in this space.
But the bounding box problem hits close to home. We've found Unstructured's API gives pretty accurate box coordinates, and with some tweaks you can make them even better. The tricky part is implementing those tweaks without burning a hole in your wallet.
Hmm I have been doing a but if this manually lately for a personal project.
I am working on some old books that are far past any copyright, but
they are not available anywhere on the net.
(Being in Norwegian m makes a book a lot more obscure)
so I have been working on creating ebooks out of them.
I have a scanner, and some OCR processes I run things through.
I am close to 85% from my automatic process.
The pain of going from 85% to 99% though is considerable.
(and in my case manual)
(well Perl helps)
I went to try this AI on one of the short poem manufscript I have.
I told the prompt I wanted PDF to Markdown, it says sure go ahead
give me the pdf.
I went upload it.
It spent a long time spinning.
then a quick messages comes up, something like
"Failed to count tokens"
but it just flashes and goes away.
I guess the PDF is too big?
Weird though, its not a lot of pages.
I experienced something similar. My use case is I need to summarize bank statements (sums, averages, etc.). Gemini wouldn't do it, it said too many pages. When I asked the max number of supported pages, it says max is 14 pages. Attempted on both 2.0 flash and 2.0 pro in VertexAI console.
Try with https://aistudio.google.com
Think the page limit is a vertex thing
The only limit in reality is the number of input tokens taken to parse the pdf.
If those tokens + tokens for the rest of your prompt are under the context window limit, you're good.
This is completely tangential, but does anyone know if AI is creating any new jobs?
Thinking of the OCR vendors who get replaced. Where might they go?
One thing I can think of is that AI could help the space industry take off. But wondering if there are any concrete examples of new jobs being created.
I've built a simple OCR tool with gemini 2 flash with several options:
1-Simple OCR: Extracts all detected text from uploaded files
2-Advanced OCR: Enables rule-based extraction (e.g., table data)
3-Bulk OCR: Designed for processing multiple files at once
The project will be open-source next week.
You can try the tool here: https://gemini2flashocr.netlify.app
I think very soon a new model will destroy whatever startups and services are built around document ingestion. As in a model that can take in a pdf page as a image and transcribe it to text with near perfect accuracy.
Extracting plain text isn’t that much of a problem, relatively speaking. It’s interpreting more complex elements like nested lists, tables, side bars, footnotes/endnotes, cross-references, images and diagrams where things get challenging.
I think the Azure Document Intelligence, Google Document AI and Amazon Textract are among the best if not the best services though and they offer these models.
I have not tested Azure Document Intelligence, Google Document AI, but AWS Textract, LLamaparse, Unstructured and Omni made to my shortlist. I have not tested Docling, as I could not install it on my Windows laptop.
They do not test Llamaparse on the accuracy benchmark. In my personal experience Llamaparse was one of the rare tools that always got the right information. Also, the accuracy is only based on tables and we had issues with irregular text structures as well. It is also worth noting that when using an LLM, a non-deterministic tool to do something deterministic is a bit risky and you need to write, modify and maintain a prompt.
Gemini Flash 2.0 is impressive but it hardly captures all of the information in the PDF. It's great for getting vibes from the document or finding overall information in it. If you ask it to e.g. enumerate every line item from multiple tables in a long PDF it still falls flat (dropping some line items or entire sections etc). DocuPanda and to a lesser extent Unstrucutred handle this.
I wish more PDFs were generated as hybrid PDFs. These are PDFs that also include their original source material. Then you have a document whose format is fixed, but if you need more semantic information, there it is!
I wonder how this compares to open source models (which might be less accurate but even cheaper if self-hosted?), e.g. Llama 3.2. I'll see if I can run the benchmark.
Also regarding the failure case in the footnote, I think Gemini actually got that right (or at least outperformed Reducto) - the original document seems to have what I call a "3D" table where the third axis is rows within each cell, and having multiple headers is probably the best approximation in Markdown.
Everything I tried previously had very disappointing results. I was trying to get rid of Azure's DocumentIntelligence, which is kind of expensive at scale. The models could often output a portion of a table, but it was nearly impossible to get them to produce a structured output of a large table on a single page; they'd often insert "...rest of table follows" and similar terminations, regardless of different kinds of prompting.
Maybe incremental processing of chunks of the table would have worked, with subsequent stitching, but if Gemini can just process it that would be pretty good.
I'm failing to understanding the Ingesting part of the Gemini 2.0? Does Gemini provide the a process to convert PDFs to Markdown API OR the LLM APIs handle it with prompt "Extract the Attached PDF" using this API: https://ai.google.dev/gemini-api/docs/document-processing?la...
Orthogonal to this post, but this just highlights the need for a more machine readable PDF alternative.
I get the inertia of the whole world being on PDF. And perhaps we can just eat the cost and let LLMs suffer the burden going forwards. But why not use that LLM coding brain power to create a better overall format?
I mean, do we really see printing things out onto paper something we need to worry about for the next 100 years? It reminds me of the TTY interface at the heart of Linux. There was a time it all made sense, but can we just deprecate it all now?
PDF does support incorporating information about the logical document structure, aka Tagged PDF. It’s optional, but recommended for accessibility (e.g. PDF/UA). See chapters 14.7–14.8 in [1]. Processing PDF files as rendered images, as suggested elsewhere in this thread, can actually dramatically lose information present in the PDF.
Alternatively, XML document formats and the like do exist. Indeed, HTML was supposed to be a document format. That’s not the problem. The problem is having people and systems actually author documents in that way in an unambiguous fashion, and having a uniform visual presentation for it that would be durable in the long term (decades at least).
PDF as a format persists because it supports virtually every feature under the sun (if authors care to use them), while largely guaranteeing a precisely defined visual presentation, and being one of the most stable formats.
I'm not suggesting we re-invent RDF or any other kind of semantic web idea. And the fact that semantic data can be stored in a PDF isn't really the problem being solved by tools such as these. In many cases, PDF is used for things like scanned documents where adding that kind of metadata can't really be done manually - in fact the kinds of tools suggested in the post would be useful for adding that metadata to the PDF after scanning (for example).
Imagine you went to a government office looking for some document from 1930s, like an ancestors marriage or death certificate. You might want to digitize a facsimile of that using a camera or a scanner. You have a lot of options to store that, JPG, PNG, PDF. You have even more options to store the metadata (XML, RDF, TXT, SQLite, etc.). You could even get fancy and zip up an HTML doc alongside a directory of images/resources that stitched them all together. But there isn't really a good standard format to do that.
It is the second part of you post that stands out - the kitchen sink nature of PDFs that make them so terrible. If they were just wrappers for image data, formatted in a way that made printing them easy, I probably wouldn't dislike them.
Strange that LlamaParse is mentioned in the pricing table but not the results. We’ve used them to process a lot of pages and it’s been excellent each time.
You can recover word-level bounding boxes and confidence scores by using a traditional OCR engine such as AWS Textract and matching the results to Gemini’s output – see https://docless.app for a demo (disclaimer: I am the founder)
If is is a vendor work, you should probably hire person who are competitive in software engineering space. And do we actually need significant amount of processing as a solution? If this is the case, common markdowned public pdfs should be open-sourced. We shouldn't repeat other's work.
If the end goal is just rag or search over the pdfs, seems like ColPali based embedding search would be a good alternative here. Don’t process the PDFs, instead just search their image embedding directly. From what I understand, you also get a sort of attention as to what part of the image is being activated by the search.
Has anyone in the AEC industry who's reading this worked out a good way to get Bluebeam MEP, electrical layouts into Revit (LOD 200-300).
Have seen MarkupX as a paid option, but it seems some AI in the loop can greatly speed up exception handling, encode family placement to certain elevations based on building code docs....
Curious to see how well this works on technical/mechanical documentation (manuals parts list etc). Has any one tried? My company Airwave had to jump through all sorts of hoops to get accurate information for our use case: getting accurate info to the technicians in the field.
ritvik here from pulse. everyone’s pretty much made the right points here, but wanted to emphasize that due to the llm architecture, they predict “the most probable text string” that corresponds to the embedding, not necessarily the exact text. this non-deterministicness is awful for customers deploying in production and a lot of our customers complained about this to us initially. the best approach is to build a sort-of “agent”-based VLM x traditional layout segmentation/reading order algos, which is what we’ve done and are continuing to do.
we have a technical blog on this exact phenomena coming out in the next couple days, will attach it here when it’s out!
I'm building a system that does regular OCR and outputs layout-following ASCII; in my admittedly limited tests it works better than most existing offerings.
It will be ready for beta testing this week or the next, and I will be looking for beta testers; if interested please contact me!
I think this is one of the few functional applications of LLMs that is really undeniably useful.
OCR has always been “untrustworthy” (as in you cannot expect it to be 100% correct and know you must account for that) and we have long used ML algorithms for the process.
It is not OCR to blame, when you have garbage in you should not expect anything of high quality, especially with handwriting and tables and different languages. Even human beings fail to understand some documents (see doctor's prescriptions)
The article mentions OCR, but you're sending a PDF how is that OCR? Or is this is mistake? What if you send photos of the pages, that would be true OCR - does the performance and price remain the same?
Anyone know if there are uses of this with PHI? Most doctors still fax reports to each other and this would help a lot to drop the load on staff when receiving and categorizing/assigning to patients
Gemini is amazing but I get this copyright error for some documents and I have a rate limit of just 10 requests per minute. Same issues with claude except the copyright error is called content warning.
> accuracy is measured with the Needleman-Wunsch algorithm
> Crucially, we’ve seen very few instances where specific numerical values are actually misread. This suggests that most of Gemini’s “errors” are superficial formatting choices rather than substantive inaccuracies. We attach examples of these failure cases below [1].
> Beyond table parsing, Gemini consistently delivers near-perfect accuracy across all other facets of PDF-to-markdown conversion.
That seems fairly useful to me, no? Maybe not for mission critical applications, but for a lot of use cases, this seems to be good enough. I'm excited to try these prompts on my own later.
This is "good enough" for Banks to use when doing due diligence. You'd be surprised how much noise is in the system with the current state of the art: algorithms/web scrapers and entire buildings of humans in places like India.
Author here — measuring accuracy in table parsing is surprisingly challenging. Subtle, almost imperceptible differences in how a table is parsed may not affect the reader's understanding but can significantly impact benchmark performance. For all practical purposes, I'd say it's near perfect (also keep in mind the benchmark is on very challenging tables).
There’s AWS Bedrock Knowledge Base (Amazon proprietary RAG solution) which can digest PDFs and, as far as I tested it on real world documents, it works pretty well and is cost effective.
I've been working on something similar the past couple months. A few thoughts:
- A lot of natural chunk boundaries span multiple pages, so you need some 'sliding window' mechanism for the best accuracy.
- Passing the entire document hurts throughput too much due to the quadratic complexity of attention. Outputs are also much worse when you use too much context.
- Bounding boxes can be solved by first generating boxes using tradition OCR / layout recognition, then passing that data to the LLM. The LLM can then link it's outputs to the boxes. Unfortunately getting this reliable required a custom sampler so proprietary models like Gemini are out of the question.
How is it for image recognition/classification? OCR can be a huge chunk of the image classification pipeline. Presumably, it works just as well in this domain?
probably a mix of economies of scale (google workspace and search are already massive customers of these models meaning the build out is already there), and some efficiency dividends from hardware r&d (google has developed the model and the TPU hardware purpose built to run it almost in parallel)
Traditional OCRs are trained for a single task: recognize characters. They do this through visual features (and sometimes there's an implicit (or even explicit) "language" model: see https://arxiv.org/abs/1805.09441). As such, the extent of their "hallucination", or errors, is when there's ambiguity in characters, e.g. 0 vs O (that's where the implicit language model comes in). Because they're trained with a singular purpose, you would expect their confidence scores (i.e. logprobs) to be well calibrated. Also, depending on the OCR model, you usually do a text detection (get bounding boxes) followed by a text recognition (read the characters), and so it's fairly local (you're only dealing with a small crop).
On the other hand, these VLMs are very generic models – yes, they're trained on OCR tasks, but also a dozen of other tasks. As such, they're really good OCR models, but they tend to be not as well calibrated. We use VLMs at work (Qwen2-VL to be specific), and we don't find it hallucinates that often, but we're not dealing with long documents. I would assume that as you're dealing with a larger set of documents, you have a much larger context, which increases the chances of the model getting confused and hallucinating.
We have been building smaller and more efficient VLMs for document extraction from way before and we are 10x faster than unstructured,reducto (the ocr vendors) with an accuracy of 90%.
P.S. - You can find us here (unsiloed-ai.com) or you can reach out to me on adnan.abbas@unsiloed-ai.com
I think they meant relative to the best other approach, which is Reducto’s given that they are the creators of the benchmark:
Reducto's own model currently outperforms Gemini Flash 2.0 on this benchmark (0.90 vs 0.84). However, as we review the lower-performing examples, most discrepancies turn out to be minor structural variations that would not materially affect an LLM’s understanding of the table.
For data extraction from long documents (100k+ tokens) how does structured outputs via providing a json schema compare vs asking one question per field (in natural language)?
Also I've been hearing good things regarding document retrieval about Gemini 1.5 Pro, 2.0 Flash and gemini-exp-1206 (the new 2.0 Pro?), which is the best Gemini model for data extraction from 100k tokens?
How do they compare against Claude Sonnet 3.5 or the OpenAI models, has anyone done any real world tests?
The write-up and ensuing conversation are really exciting. I think out of everything mentioned here - the clear stand-out point is that document layout analysis (DLA) is the crux of the issue for building practical doc ingestion for RAG.
(Note: DLA is the process of identifying and bounding specific segments of a document - like section headers, tables, formulas, footnotes, captions, etc.)
Strap in - this is going to be a longy.
We see a lot of people and products basically sending complete pages to LVLMs for converting to a machine-readable format, and for chunking. We tried this + it’s a possible configuration on chunkr as well. It has never worked for our customers, or during extensive internal testing across documents from a variety of verticals. Here are SOME of the common problems:
- Most documents are dense. The model will not OCR everything and miss crucial parts.
- A bunch of hallucinated content thats tough to catch.
- Occasionally it will just refuse to give you anything. We’ve tried a bunch of different prompting techniques and the models return “<image>” or “||..|..” for an ENTIRE PAGE of content.
Despite this - it’s obvious that these ginormous neural nets are great for complex conversions like tables and formulas to HTML/Markdown & LateX. They also work great for describing images and converting charts to tables. But that’s the thing - they can only do this if you can pull out these document features individually as cropped images and have the model focus on small snippets of the document rather than the full page.
If you want knobs for speed, quality, and cost, the best approach is to work at a segment level rather than a page level. This is where DLA really shines - the downstream processing options are vast and can be fit to specific needs. You can choose what to process with simple + fast OCR (text-only segments like headers, paragraphs, captions), and what to send to a large model like Gemini (complex segments like tables, formulas, and images) - all while getting juicy bounding boxes for mapping citations. Combine this with solid reading order algos - and you get amazing layout-aware chunking that takes ~10ms.
We made RAG apps ourselves and attempted to index all ~600 million pages of open-access research papers for https://lumina.sh. This is why we built Chunkr - and it needed to be Open Source. You can self-host our solution and process 4 pages per second, scaling up to 11 million pages per month on a single RTX 4090, renting this hardware on Runpod costs just $249/month ($0.34/hour).
A VLM to do DLA sounds awesome. We've played around with this idea but found that VLMs don't come close to models where the architecture is solely geared toward these specific object detection tasks. While it would simplify the pipeline, VLMs are significantly slower and more resource-hungry - they can't match the speed we achieve on consumer hardware with dedicated models. Nevertheless, the numerous advances in the field are very exciting - big if true!
A note on costs:
There are some discrepancies between the API pricing of providers listed in this thread. Assuming 100000 pages + feature parity:
Chunkr API - 200 pages for $1, not 100 pages
AWS Textract - 40 pages for $1, not 1000 pages (No VLMs)
Llama Parse - 13 pages for $1, not 300
A note on RD-Bench:
We’ve been using Gemini 1.5 Pro for tables and other complex segments for a while, so the RD-bench is very outdated. We ran it again on a few hundred samples and got a 0.81 (also includes some notes on the bench itself). To the OP: it would be awesome if you could update your blog post!
Anyone who cries “<service> is dead” after some new technology is introduced is someone you can safely ignore. For ever. They’re hyperbolic clout chasers who will only ever be right by mistake.
As if, when ChatGPT was introduced, Google would just stay still, cross their arms, and say “well, this is based on our research paper but there’s nothing we can do, going to just roll over and wait for billions of dollars to run out, we’re truly doomed”. So unbelievably stupid.
Clickbait. It doesn't change "everything". It makes ingestion for RAG much less expensive (and therefore feasible in a lot more scenarios), at the expense of ~7% reduction in accuracy. Accuracy is already rather poor even before this, however, with the top alternative clocking in at 0.9. Gemini 2.0 is 0.84, although the author seems to suggest that the failure modes are mostly around formatting rather than e.g. mis-recognition or hallucinations.
TL;DR: is this exciting? If you do RAG, yes. Does it "change everything" nope. There's still a very long way to go. Protip for model designers: accuracy is always in greater demand than performance. A slow model that solves the problem is invariably better than a fast one that fucks everything up.
And people always have a hard time understanding what a certain degree of accuracy actually means. E.g. when you hear that a speech recognition system has 95% accuracy (5% WER), it means that it gets every 19th word wrong. That's abysmally bad by human standards - errors in every other sentence. That does not mean it's useless, but you do need to understand very clearly what you're dealing with, and what those errors might do to the rest of your system.
Now, I could look at this relatively popular post about Google and revise my opinion of HN as an echo chamber, but I’m afraid it’s just that the downvote loving HNers weren’t able to make the cognitive leap from Gemini to Google.
We’ve generally found that Gemini 2.0 is a great model and have tested this (and nearly every VLM) very extensively.
A big part of our research focus is incorporating the best of what new VLMs offer without losing the benefits and reliability of traditional CV models. A simple example of this is we’ve found bounding box based attribution to be a non-negotiable for many of our current customers. Citing the specific region in a document where an answer came from becomes (in our opinion) even MORE important when using large vision models in the loop, as there is a continued risk of hallucination.
Whether that matters in your product is ultimately use case dependent, but the more important challenge for us has been reliability in outputs. RD-TableBench currently uses a single table image on a page, but when testing with real world dense pages we find that VLMs deviate more. Sometimes that involves minor edits (summarizing a sentence but preserving meaning), but sometimes it’s a more serious case such as hallucinating large sets of content.
The more extreme case is that internally we fine tuned a version of Gemini 1.5 along with base Gemini 2.0, specifically for checkbox extraction. We found that even with a broad distribution of checkbox data we couldn’t prevent frequent checkbox hallucination on both the flash (+17% error rate) and pro model (+8% error rate). Our customers in industries like healthcare expect us to get it right, out of the box, deterministically, and our team’s directive is to get as close as we can to that ideal state.
We think that the ideal state involves a combination of the two. The flexibility that VLMs provide, for example with cases like handwriting, is what I think will make it possible to go from 80 or 90 percent accuracy to some number very close 99%. I should note that the Reducto performance for table extraction is with our pre-VLM table parsing pipeline, and we’ll have more to share in terms of updates there soon.
For now, our focus is entirely on the performance frontier (though we do scale costs down with volume). In the longer term as inference becomes more efficient we want to move the needle on cost as well.
Overall though, I’m very excited about the progress here.
---
One small comment on your footnote, the evaluation script with Needlemen-Wunsch algorithm doesn’t actually consider the headers outputted by the models and looks only at the table structure itself.
I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large context window model in terms of ease-of-use. Ironically this vendor is the best known and most successful vendor for OCR'ing this specific type of PDF but many of our requests failed over to their human-in-the-loop process. Despite it not being their specialization switching to Gemini was a no-brainer after our testing. Processing time went from something like 12 minutes on average to 6s on average, accuracy was like 96% of that of the vendor and price was significantly cheaper. For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair". We probably could improve our prompt to clean up this data even further. Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.
Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!
This is spot on, any legacy vendor focusing on a specific type of PDF is going to get obliterated by LLMs. The problem with using an off-the-shelf provider like this is, you get stuck with their data schema. With an LLM, you have full control over the schema meaning you can parse and extract much more unique data.
The problem then shifts from "can we extract this data from the PDF" to "how do we teach an LLM to extract the data we need, validate its performance, and deploy it with confidence into prod?"
You could improve your accuracy further by adding some chain-of-thought to your prompt btw. e.g. Make each field in your json schema have a `reasoning` field beforehand so the model can CoT how it got to its answer. If you want to take it to the next level, `citations` in our experience also improves performance (and when combined with bounding boxes, is powerful for human-in-the-loop tooling).
Disclaimer: I started an LLM doc processing infra company (https://extend.app/)
> The problem then shifts from "can we extract this data from the PDF" to "how do we teach an LLM to extract the data we need, validate its performance, and deploy it with confidence into prod?"
A smart vendor will shift into that space - they'll use that LLM themselves, and figure out some combination of finetunes, multiple LLMs, classical methods and human verification of random samples, that lets them not only "validate its performance, and deploy it with confidence into prod", but also sell that confidence with an SLA on top of it.
28 replies →
I have some out-of-print books that I want to convert into nice pdf's/epubs (like, reference-quality)
1) I don't mind destroying the binding to get the best quality. Any idea how I do so?
2) I have a multipage double-sided scanner (fujitsu scansnap). would this be sufficient to do the scan portion?
3) Is there anything that determines the font of the book text and reproduces that somehow? and that deals with things like bold and italic and applies that either as markdown output or what have you?
4) how do you de-paginate the raw text to reflow into (say) an epub or pdf format that will paginate based on the output device (page size/layout) specification?
Great, I landed on the reasoning and citations bit through trial and error and the outputs improved for sure.
`How did you add bounding boxes, especially if it is variety of files?
1 reply →
So why should I still use Extend instead of Gemini?
How do you handle the privacy of the scanned documents?
4 replies →
> After trial and error with different models
As a mere occasional customer I've been scanning 4 to 5 pages of the same document layout every week in gemini for half a year, and every single week the results were slightly different.
To note the docs are bilingual so it could affect the results, but what stroke me is the lack of consistency, and even with the same model, running it two or three times in a row gives different results.
That's fine for my usage, but that sounds like a nightmare if everytime Google tweaks their model, companies have to reajust their whole process to deal with the discrepancies.
And sticking with the same model for multiple years also sound like a captive situation where you'd have to pay premium for Google to keep it available for your use.
Consider turning down the temperature in the configuration? LLMs have a bit of randomness in them.
Gemini 2.0 Flash seems better than 1.5 - https://deepmind.google/technologies/gemini/flash/
> and every single week the results were slightly different.
This is one of the reasons why open source offline models will always be part of the solution, if not the whole solution.
2 replies →
At temperature zero, if you're using the same API/model, this really should not be the case. None of the big players update their APIs without some name / version change.
24 replies →
That’s why you have azure openAI APIs which give a lot more consistency
Wait isn't there atleast a two step process here one is semantic segmentation followed by a method like texttract for text - to avoid hallucinations?
One cannot possibly say that "Text extracted by a multimodal model cannot hallucinate"?
> accuracy was like 96% of that of the vendor and price was significantly cheaper.
I would like to know how this 96% was tested. If you use a human to do random sample based testing, well how do you adjust the random sample for variations in distribution of errors that vary like a small set of documents could have 90% of the errors and yet they are only 1% of the docs?
One thing people always forget about traditional OCR providers (azure, tesseract, aws textract, etc.) is that they're ~85% accurate.
They are all probabilistic. You literally get back characters + confidence intervals. So when textract gives you back incorrect characters, is that a hallucination?
10 replies →
For an OCR company I imagine it is unconscionable to do this because if you would say OCR for an Oral History project for a library and you made hallucination errors, well you've replaced facts with fiction. Rewriting history? What the actual F.
3 replies →
Can confirm using gemini, some figure numbers were hallucinated. I had to cross-check each row to make sure data extracted is correct.
1 reply →
Wouldn’t the temperature on something like OCR be very low. You want the same result every time. Isn’t some part of hallucination the randomness of temperature?
3 replies →
The LLM's are near perfect (maybe parsing I instead of 1) - if you're using the outputs in the context of RAG, your errors are likely much much higher in the other parts of your system. Spending a ton of time and money chasing 9's when 99% of your system's errors have totally different root causes seems like a bad use of time (unless they're not).
This sounds extremely like my old tax accounting job. OCR existed and "worked" but it was faster to just enter the numbers manually than fix all the errors.
Also, the real solution to the problem should have been for the IRS to just pre-fill tax returns with all the accounting data that they obviously already have. But that would require the government to care.
Germany (not exactly the cradle of digitalization) already auto-fills salary tax fields with data from the employer.
They finally made filing free.
So, maybe this century?
3 replies →
This is a big aha moment for me.
If Gemini can do semantic chunking at the same time as extraction, all for so cheap and with nearly perfect accuracy, and without brittle prompting incantation magic, this is huge.
Could it do exactly the same with a web page? Would this replace something like beautiful soup?
1 reply →
If I used Gemini 2.0 for extraction and chunking to feed into a RAG that I maintain on my local network, then what sort of locally-hosted LLM would I need to gain meaningful insights from my knowledge base? Would a 13B parameter model be sufficient?
3 replies →
Small point but is it doing semantic chunking, or loading the entire pdf into context? I've heard mixed results on semantic chunking.
8 replies →
It's cheap now because Google is subsidizing it, no?
2 replies →
This is great, I just want to highlight out how nuts it is that we have spun up whole industries around extracting text that was typically printed from a computer, back into a computer.
There should be laws that mandates that financial information be provided in a sensible format: even Office Open XML would be better than this insanity. Then we can redirect all this wasted effort into digging ditches and filling them back in again.
I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.
This is giving me hope that it's possible.
(from the gemini team) we're working on it! semantic chunking & extraction will definitely be possible in the coming months.
>>I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.
For this specific use case you can also try edgartools[1] which is a library that was relatively recently released that ingests SEC submissions and filings. They don't use OCR but (from what I can tell) directly parse the XBRL documents submitted by companies and stored in EDGAR, if they exist.
[1] https://github.com/dgunning/edgartools
1 reply →
If you'd kindly tl;dr the chunking strategies you have tried and what works best, I'd love to hear.
isn't everyone on iXBRL now? Or are you struggling with historical filings?
1 reply →
How do today’s LLM’s like Gemini compare with the Document Understanding services google/aws/azure have offered for a few years, particularly when dealing with known forms? I think Google’s is Document AI.
I've found the highest accuracy solution is to OCR with one of the dedicated models then feed that text and the original image into an LLM with a prompt like:
"Correct errors in this OCR transcription".
7 replies →
member of the gemini team here -- personally, i'd recommend directly using gemini vs the document understanding services for OCR & general docs understanding tasks. From our internal evals gemini is now stronger than these solutions and is only going to get much better (higher precision, lower hallucination rates) from here.
1 reply →
GCP's Document AI service is now literally just a UI layer specific to document parsing use-cases back by Gemini models. When we realized that we dumped it and just use Gemini directly.
Your OCR vendor would be smart to replace their own system with Gemini.
They will, and they'll still have a solid product to sell, because their value proposition isn't accurate OCR per se, but putting an SLA on it.
Reaching reliability with LLM OCR might involve some combination of multiple LLMs (and keeping track of how they change), perhaps mixed with old-school algorithms, and random sample reviews by humans. They can tune this pipeline however they need at their leisure to eke out extra accuracy, and then put written guarantees on top, and still be cheaper for you long-term.
With “Next generation, extremely sophisticated AI” to be precise, I wait say. ;)
Marketing joke aside, maybe a hybrid approach could serve the vendor well. Best of both worlds if it reaps benefits or even have a look at hugging face for even more specialized aka better LLMs.
I work in financial data and our customers would not accept 96% accuracy in the data points we supply. Maybe 99.96%.
For most use cases in financial services, accurate data is very important.
so, what solution are you using to extract data with 99.96% accuracy?
I'm curious to hear about your experience with this. Which solution were you using before (the one that took 12 minutes)? If it was a self-hosted solution, what hardware were you using? How does Gemini handle PDFs with an unknown schema, and how does it compare to other general PDF parsing tools like Amazon Textract or Azure Document Intelligence? In my initial test, tables and checkboxes weren't well recognized.
> For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair".
I'm actually somewhat surprised Gemini didn't guess from context that LLC is much more likely?
I guess the OCR subsystem is intentionally conservative? (Though I'm sure you could do a second step on your end, take the output from the conservative OCR pass, and sent it through Gemini and ask it to flag potential OCR problems? I bet that would flag most of them with very few false positives and false negatives.)
Where I work we've had great success at using LLMs to OCR paper documents that look like
https://static.foxnews.com/foxnews.com/content/uploads/2023/...
but were often written with typewriters long ago to get nice structured tabular output. Deals with text being split across lines and across pages just fine.
How about the comparison with traditional proprietary on premise software like ONMIPage or ABBYY or those listed below: https://en.wikipedia.org/wiki/Comparison_of_optical_characte...
It is cheaper now, but I wonder if it will continue to be cheaper when companies like Google and OpenAI decide they want to make a profit off of AI, instead of pouring billions of dollars of investment funds into it. By the time that happens, many of the specialized service providers will be out of business and Google will be free to jack up the price.
I use Claude through OpenRouter (with Aider), and was pretty amazed to see that it routes the requests during the same session almost round-robin through Amazon Bedrock, sometimes through Google Vertex, sometimes through Anthropic themselves, all of course using the same underlying model.
Literally whoever has the cheapest compute.
With the speed that AI models are improving these days, it seems like the 'moat' of a better model is only a few months before it is commoditized and goes to the cheapest provider.
What are the pdfs containing?
I’ve been wanting to build a system that ingests pdf reports that reference other types of data like images, csv, etc. that can also be ingested to ultimately build an analytics database from the stack of unsorted data AB’s meta data but I have not found any time to do anything like that yet. What kind of tooling do you use to build your data pipelines?
It's great to hear it's this good, and it makes sense since Google has had several years of experience creating document-type-specific OCR extractors as components of their Document AI product in Cloud. What most heartening is to hear that the legwork they did for that set of solutions has made it into Gemini for consumers (and businesses).
Successful document processing vendors to use LLMs already. I know this at least of klippa. They have (apparently) fine-tuned models, prompts etc. The biggest issue with using LLMs directly is error handling, validation and "parameter drift"/randomness. This is the typical "I'll build it myself but worse" thing
I'm interested to hear what your experience has been dealing with optional data. For example if the input pdf has fields which are sometimes not populated or nonexistent, is Gemini smart enough to leave those fields blank in the output schema? Usually the LLM tries to please you and makes up values here.
You could ingest them with AWS Textract and have predictability and formatting in the format of your choice. Using LLMs for this is lazy and generates unpredictable and non-deterministic results.
Did you try other vision models such as ChatGPT and Grok? I'm doing something similar but struggled to find good comparisons in between the vision models in terms OCR and document understanding.
If the documents have the same format, maybe you could include an example document in the prompt, so the boilerplate stuff (like LLC) gets handled properly.
You could probably take this a step further and pipe the OCR'ed text into Claude 3.5 Sonnet and get it to fix any OCR errors
What if you prompt Gemini that mistaking LLC for IIC is a common mistake? Will Gemini auto correct it?
With lower temperature, it seems to work okay for me.
A _killer_ awesome thing it does too is allow code specification in the config instead of through repeated attempts at prompts.
Just to make sure: you are talking about your experiences with Gemini 1.5 Flash here, right?
Hi! Any guesstimate for pages/minute from your Gemini OCR experience? Thanks!
So are you mostly processing PDFs with data? Or PDFs with just text, or images, graphs?
Not the parent, but we process PDFs with text, tables, diagrams. Works well if the schema is properly defined.
Is privacy a concern?
Why would it be? Their only concern is IPO.
In fintech I'd suspect the PDFs are public knowledge
What hardware are you using to run it?
The Gemini model isn't open so it does not matter what hardware you have. You might have confused Gemini with Gemma.
1 reply →
“LLC” to “IIC” is one thing. But wouldn’t that also make it just as easy to to mistake something like “$100” for “$700”?
Out of interest, did you parse into any sort of defined schema/structure?
Parent literally said so …
> Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.
[dead]
The Gemini api has a customer noncompete, so it’s not an option for AI, what are you working on that doesn’t compete with AI?
You do realize most people aren't working on AI, right?
Also, OP mentioned fintech at the outset.
what doesn't compete with ai?
This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.
You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.
You use an object detection model trained on documents to find the bounding boxes of each document section as _images_; each bounding box comes with a confidence score for free.
You then feed each box of text to a regular OCR model, also gives you a confidence score along with each prediction it makes.
You feed each image box into a multimodal model to describe what the image is about.
For tables, use a specialist model that does nothing but extract tables—models like GridFormer that aren't hyped to hell and back.
You then stitch everything together in an XML file because Markdown is for human consumption.
You now have everything extracted with flat XML markup for each category the object detection model knows about, along with multiple types of probability metadata for each bounding box, each letter, and each table cell.
You can now start feeding this data programmatically into an LLM to do _text_ processing, where you use the XML to control what parts of the document you send to the LLM.
You then get chunking with location data and confidence scores of every part of the document to put as meta data into the RAG store.
I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.
Not sure what service you're basing your calculation on but with Gemmini I've processed 10,000,000+ shipping documents (PDF and PNGs) of every concievable layout in one month at under $1000 and an accuracy rate of between 80-82% (humans were at 66%).
The longest part of the development timeline was establishing the accuracy rate and the ingestion pipeline, which itself is massively less complex than what your workflow sounds like: PDF -> Storage Bucket -> Gemini -> JSON response -> Database
Just to get sick with it we actually added some recusion to the Gemini step to have it rate how well it extracted, and if it was below a certain rating to rewrite its own instructions on how to extract the information and then feed it back into itself. We didn't see any improvement in accuracy, but it was still fun to do.
>Not sure what service you're basing your calculation on but with Gemmini
The table of costs in the blog post. At 500,000 pages per day the hardware fixed cost overcomes the software variable cost at day 240 and from then on you're paying an extra ~$100 per day to keep it running in the cloud. The machine also had to use extremely beefy GPUs to fit all the models it needed to. Compute utilization was between 5 to 10% which means that it's future proof for the next 5 years at the rate at which the data source was growing.
There is also the fact that it's _completely_ local. Which meant we could throw in every proprietary data source that couldn't leave the company at it.
>The longest part of the development timeline was establishing the accuracy rate and the ingestion pipeline, which itself is massively less complex than what your workflow sounds like: PDF -> Storage Bucket -> Gemini -> JSON response -> Database
Each company should build tools which match the skill level of their developers. If you're not comfortable training models locally with all that entails off the shelf solutions allow companies to punch way above their weight class in their industry.
2 replies →
Very cool! How are you storing it to a database - vectors? What do you do with the extracted data (in terms of being able to pull it up via some query system)?
3 replies →
> [with] an accuracy rate of between 80-82% (humans were at 66%)
Was this human-verified in some way? If not, how did you establish the facts-on-the-ground about accuracy?
1 reply →
I feel compelled to reply. You've made a bunch of assumptions, and presented your success (likely with a limited set of table formats) as the one true way to parse PDFs. There's no such thing.
In real world usage, many tables are badly misaligned. Headers are off. Lines are missing between rows. Some columns and rows are separated by colors. Cells are merged. Some are imported from Excel. There are dotted sub sections, tables inside cells etc. Claude (and now Gemini) can parse complex tables and convert that to meaningful data. Your solution will likely fail, because rules are fuzzy in the same way written language is fuzzy.
Recently someone posted this on HN, it's a good read: https://lukaspetersson.com/blog/2025/bitter-vertical/
> You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.
No, not like that, but often as nested Json or Xml. For financial documents, our accuracy was above 99%. There are many ways to do error checking to figure out which ones are likely to have errors.
> This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.
One should refrain making statements about cost without knowing how and where it'll be used. When processing millions of PDFs, it could be a problem. When processing 1000, one might prefer Gemini/other over spending engineering time. There are many apps where processing a single doc is say $10 in revenue. You don't care about OCR costs.
> I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.
The author presented techniques which worked for them. It may not work for you, because there's no one-size-fits-all for these kinds of problems.
Related discussion:
AI founders will learn the bitter lesson
https://news.ycombinator.com/item?id=42672790 - 25 days ago, 263 comments
The HN discussion contains a lot of interesting ideas, thanks for the pointer!
You're making an even less charitable set of assumptions:
1). I'm incompetent enough to ignore publicly available table benchmarks.
2). I'm incompetent enough to never look at poor quality data.
3). I'm incompetent enough to not create a validation dataset for all models that were available.
Needless to say you're wrong on all three.
My day rate is $400 + taxes per hour if you want to be run through each point and why VLMs like Gemini fail spectacularly and unpredictably when left to their own devices.
2 replies →
Marker (https://www.github.com/VikParuchuri/marker) works kind of like this. It uses a layout model to identify blocks and processes each one separately. The internal format is a tree of blocks, which have arbitrary fields, but can all render to html. It can write out to json, html, or markdown.
I integrated gemini recently to improve accuracy in certain blocks like tables. (get initial text, then pass to gemini to refine) Marker alone works about as well as gemini alone, but together they benchmark much better.
I used sxml [0] unironically in this project extensively.
The rendering step for reports that humans got to see was a call to pandoc after the sxml was rendered to markdown - look ma we support powerpoint! - but it also allowed us to easily convert to whatever insane markup a given large (or small) language model worked best with on the fly.
[0] https://en.wikipedia.org/wiki/SXML
Why process separately, if there are ink smudges, photocopier glitches, etc. wouldn't it guess some stuff better from richer context, like acronyms in rows used across the other tables?
It's funny you astroturf your own project in a thread where another is presenting tangential info about their own
what does marker add on top of docling?
1 reply →
This is a great comment. I will mention another benefit to this approach: the same pipeline works for PDFs that are digital-native and don't require OCR. After the object detection step, you collect the text directly from within the bounding boxes, and the text is error-free. Using Gemini means that you give this up.
You‘re describing yesterdays world. With the advancement of AI, there is no need for any of these many steps and stages of OCR anymore. There is no need for XML in your pipeline because Markdown is now equally suited for machine consumption by AI models.
The results we got 18 months ago are still better than the current gemini benchmarks at a fraction the cost.
As for markdown, great. Now how do you encode the meta data about the confidence of the model that the text says what it thinks it says? Becuase xml has this lovely thing called attributes that let's you keep a provenance record without a second database that's also readable by the llm.
Just commenting here so that I can find back to this comment later. You perfectly captured the AI hype in one small paragraph.
6 replies →
> I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.
That is impressive. However, if someone needs to read a couple of hundred pages per day, there's no point in setting all that up.
Also, you neglected to mention the cost of setting everything up. The machine cost $20k; but your time, and cost to train yolo8, probably cost more than that. If you want to compare costs (find a point where local implementation such as this is better ROI), you should compare fully loaded costs.
Or, depending on your use case, you do it in one step and ask an LLM to extract data from a PDF.
What you describe is obviously better and more robust by a lot, but the LLM only approach is not "wrong". It’s simple, fast, easy to setup and understand, and it works. With less accuracy but it does work. Depending on the constraints, development budget and load it’s a perfectly acceptable solution.
We did this to handle 2000 documents per month and are satisfied with the results. If we need to upgrade to something better in the future we will, but in the mean time, it’s done.
Fwiw, I'm not convinced Gemini isn't using an document-based objection detection model for this, at least some parts of this or for some doc categories (especially common things like IDs, bills, tax forms, invoices & POs, shipping documents, etc that they've previously created document extractors for (as part of their DocAI cloud service).
I don't see why they would do that. The whole point of training a model like Gemini is that you train the model - if they want it to work great against those different categories of document the likely way to do it is to add a whole bunch of those documents to Gemini's regular training set.
Getting "bitter lesson" vibes from this post
The bitter lesson is very little of the sort.
If we had unlimited memory, compute and data we'd use a rank N tensor for an input of length N and call it a day.
Unfortunately N^N grows rather fast and we have to do all sorts of interesting engineering to make ML calculations complete before the heat death of the universe.
5 replies →
Only thing I could find about GridFormer and tables was this: https://arxiv.org/pdf/2309.14962v1
But there is no GitHub link or details on the implementation. Only model available seems to be one for removing weather effects from images: https://github.com/TaoWangzj/GridFormer
Could you care to expand on how you would use GridFormer for extracting tables from images? Seems like it's not as trivial as using something like Excalibur or Tabula, both which seem more battle-tested.
That sounds like a sound approach. Are the steps easliy upgradable with better models? Also it sounds like you can use an character recognition model on single characters? Do you do extra checks for numerical characters?
This is exactly the wrong mentality to have about new technology.
Impressive. Can you share anything more about this project? 500k pages a day is massive and I can imagine why one would require that much throughput.
It was a financial company that needed a tool that would out perform Bloomberg terminal for traders and quants in markets where their coverage is spotty.
You mentioned Grid Former, i found a paper describing it (Grid Former: Towards Accurate Table Structure Recognition via Grid Prediction). How did you implemented it?
Do you know another model than gridformer to detect table that has an available implementation somewhere ?
We had to roll our own from research papers unfortunately.
The number one take away we got was to use much larger images than anything that anyone else ever mentioned to get good results. A rule of thumb was that if you print the png of the image it should be easily readable from 2m away.
The actual model is proprietary and stuck in corporate land forever.
I honestly can't tell if you are being serious. Is there any doubt that the "OCR pipeline" will just be an LLM and it's just a matter of time?
What you are describing is similar to how computer used to detect cats. You first extract edges, texture and gradient. Then use a sliding window and run a classifier. Then you use NMS to merge the bounding boxes.
What object detection model do you use?
Is tesseract even ML based? Oh, this piece of software is more than 19 years old, perhaps there are other ways to do good, cheap OCR now. Does Gemini have an OCR library, internally? For other LLMs, I had the feeling that the LLM scripts a few lines of python to do the actual heavy lifting with a common OCR framework.
Custom trained yolo v8. I've moved on since then and the work was done in 2023. You'd get better results for much less today.
Isn't it amazing how one company invented a universally spread format that takes structured data from an editor (except images obviously) and converts it into a completely fucked-up unstructured form that then requires expensive voodoo magic to convert back into structured data.
What would it have taken to store the plain text in some meta field in the document. Argh, so annoying.
PDF provide that capability, but editors don't produce it, probably because printing is though OS drivers that don't support it, or PDF generators that don't support it. Or they do support it but users don't know to check that option, or turn it off because it makes PDFs too large.
3 replies →
PDF supports that just fine. It's just that many PDF publishers choose not to use that.
You can lead a horse to water...
PDFs began as just postscript commands stored in a file. It’s a genius hack in a way that has become a Frankenstein’s monster.
People kind of dump whatever in pdf files, so I don't think a cleaner file format would do as much as you might think.
Digital fax services will generate pdf files, for example. They're just image data dumped into a pdf. Various scanners will also do so.
is "put this glyph at coordinate (x,y)" really what you'd call "structured"?
He's calling PDFs unstructured: structured editors -> unstructured PDF -> structured data
It's not the structure that allows meaningful understanding.
Something that was clearly a table now becomes a bunch of glphy's physically close to eachother vs a group of other glphys but when considered as a group is a box visually separated from another group of glphys but actually part of a table.
In my experience AWS Textextract does a pretty good job without using LLMs.
... and call's it "portable", to boot.-
We are driving full speed into a xerox 2.0 moment and this time we are doing so knowingly. At least with xerox, the errors were out of place and easy to detect by a human. I wonder how many innocent people will lose their lives or be falsely incarcerated because of this. I wonder if we will adapt our systems and procedures to account for hallucinations and "85%" accuracy.
And no, outlawing use the use of AI or increasing liability with its use will have next to nothing to deter its misuse and everyone knows it. My heart goes out to the remaining 15%.
I love generative AI as a technology. But the worst thing about its arrival has been the reckless abandonment of all engineering discipline and common sense. It’s embarrassing.
CCC talk about Xerox copiers changing numbers when doing OCR:
https://media.ccc.de/v/31c3_-_6558_-_de_-_saal_g_-_201412282...
Would be nice to get a translation for a broader audience, glad folks are reporting this out!
1 reply →
the first thing that guy says that existing non-AI solutions are not that great. then he says that AI beats them in the accuracy. so i don't quite understand the point you're trying to make here
Humans accept a degree of error for convenience. (driving is one of them). But no, 15% is not the acceptable rate. More like 0.15% to 0.015% depending on the country.
Meh, just maintain an audit log and an escalation subsystem. No need to be luddites when the problems are process, not tech stack.
(disclaimer I am CEO of llamaindex, which includes LlamaParse)
Nice article! We're actively benchmarking Gemini 2.0 right now and if the results are as good as implied by this article, heck we'll adapt and improve upon it. Our goal (and in fact the reason our parser works so well) is to always use and stay on top of the latest SOTA models and tech :) - we blend LLM/VLM tech with best-in-class heuristic techniques.
Some quick notes: 1. I'm glad that LlamaParse is mentioned in the article, but it's not mentioned in the performance benchmarks. I'm pretty confident that our most accurate modes are at the top of the table benchmark - our stuff is pretty good.
2. There's a long tail of issues beyond just tables - this includes fonts, headers/footers, ability to recognize charts/images/form fields, and as other posters said, the ability to have fine-grained bounding boxes on the source elements. We've optimized our parser to tackle all of these modes, and we need proper benchmarks for that.
3. DIY'ing your own pipeline to run a VLM at scale to parse docs is surprisingly challenging. You need to orchestrate a robust system that can screenshot a bunch of pages at the right resolution (which can be quite slow), tune the prompts, and make sure you're obeying rate limits + can retry on failure.
The very first (and probably hand-picked & checked) example on your website [0] suffers from the very problem people are talking about here - in "Fiscal 2024" row it contains an error for CEO CAP column. On the image it says "$234.1" but the parsed result says "$234.4". A small error, but error nonetheless. I wonder if we can ever fix these kind of errors with LLM parsing.
[0] https://www.llamaindex.ai/llamaparse
Looks like this was fixed, the parsed result says "$234.1" on my end. I wonder if the error was fixed manually or with another round of LLM parsing?
I'm a happy customer. I wrote a ruby client for your API and have been parsing thousands of different types of PDFs through it with great results. I tested almost everything out there at the time and I couldn't find anything that came close to being as good as llamaparse.
Indeed, this is also my experience. I have tried a lot of things and where quality is more important than quantity, I doubt there are many tools that can come close to Llamaparse.
All your examples are exquisitely clean digital renders of digital documents. How does it fare with real scans (noise, folds) or photos? Receipts?
Or is there a use case for digital non-text pdfs? Are people really generating image and not text-based PDFs? Or is the primary use case extracting structure, rather than text?
Hi Jerry,
How well does llamaparse work on foreign-language documents?
I have pipeline for Arabic-language docs using Azure for OCR and GPT-4o-mini to extract structured information. Would it be worth trying llamaparse to replace part of the pipeline or the whole thing?
yes! we have foreign language support for better OCR on scans. Here's some more details. Docs: https://docs.cloud.llamaindex.ai/llamaparse/features/parsing... Notebook: https://github.com/run-llama/llama_parse/blob/main/examples/...
2 replies →
There's an error right on your landing page [1] with the parsed document...
It's supposed to say 234.1, not 234.4
https://www.llamaindex.ai/llamaparse
But can it do this table?!:
https://x.com/preston_mos/status/1853931388929511619?s=46
I've been using NotebookLM powered by Gemini 2.0 for three projects and it is _really powerful_ for comprehending large corpuses you can't possibly read and thinking informed by all your sources. It has solid Q&A. When you ask a question or get a summary you like [which often happens] you can save it as a new note, putting it into the corpus for analysis. In this way your conclusions snowball. Yes, this experience actually happens and it is beautiful.
I've tried Adobe Acrobat AI for this and it doesn't work yet. NotebookLM is it. The grounding is the reason it works - you can easily click on anything and it will take you to the source to verify it. My only gripe is that the visual display of the source material is _dogshit ugly_, like exceptionally so. Big blog pink background letters in lines of 24 characters! :) It has trouble displaying PDF columns, but at least it parses them. The ugly will change I'm sure :)
My projects are setup to let me bridge the gaps between the various sources and synthesize something more. It helps to have a goal and organize your sources around that. If you aren't focused, it gets confused. You lay the groundwork in sources and it helps you reason. It works so well I feel _tender_ towards it :) Survey papers provide background then you add specific sources in your area of focus. You can write a profile for how you would like NotebookLM to think - which REALLY helps out.
They are:
* The Stratigrapher - A Lovecraftian short story about the world's first city. All of Seton Lloyd/Faud Safar's work on Eridu. Various sources on Sumerian culture and religion All of Lovecraft's work and letters. Various sources about opium Some articles about nonlinear geometries
* FPGA Accelerated Graph Analytics An introduction to Verilog Papers on FPGAs and graph analytics Papers on Apache Spark architecture Papers on GraphFrames and a related rant I created about it and graph DBs A source on Spark-RAPIDS Papers on subgraph matching, graphlets, network motifs Papers on random graph models
* Graph machine learning notebook without a specific goal, which has been less successful. It helps to have a goal for the project. It got confused by how broad my sources were.
I would LOVE to share my projects with you all, but you can only share within a Google Workspaces domain. It will be AWESOME when they open this thing up :)
thanks a ton for all the amazing feedback on this thread! if
(a) you have document understanding use cases that you'd like to use gemini for (the more aspirational the better) and/or
(b) there are loss cases for which gemini doesn't work well today,
please feel free to email anirudhbaddepu@google.com and we'd love to help get your use case working & improve quality for our next series of model updates!
What if you need scan pages from thick paper books or binded documents without specialized book scanner?
I have two user cases in mind:
1. Photographs of open book.
2. Having video feed of open book where someone flips pages manually.
We parse millions of PDFs using Apache Tika and process about 30,000 per dollar of compute cost. However, the structured output leaves something to be desired, and there are a significant number of pages that Tika is unable to parse.
https://tika.apache.org/
Under the hood tika uses tesseract for ocr parsing. For clarity this all works surprisingly well generally speaking and it’s pretty easy to run your self and order of magnitude cheaper than most services out there.
https://tesseract-ocr.github.io/tessdoc/
In my mind, Gemini 2.0 changes everything because of the incredibly long context (2M tokens on some models), while having strong reasoning capabilities.
We are working on compliance solution (https://fx-lex.com) and RAG just doesn’t cut it for our use case. Legislation cannot be chunked if you want the model to reason well about it.
It’s magical to be able to just throw everything into the model. And the best thing is that we automatically benefit from future model improvements along all performance axes.
What does "throw everything into the model" entail in your context?
How much data are you able to feed into the model in a single prompt and on what hardware, if I may ask?
Gemini models run in the cloud, so there is no issue with hardware.
The EU regulations typically include delegated acts, technical standards, implementation standards and guidelines. With Gemini 2.0 we are able to just throw all of this into the model and have it figure out.
This approach gives way better results than anything we are able to achieve with RAG.
My personal bet is that this is how the future will look like. RAG will remain relevant, but only for extremely large document corpuses.
Maybe a dumb question, have you tried fine tuning on the corpus, and then adding a reasoning process (like all those R1 distillations)?
We haven't tried that, we might do that in the future.
My intuition - not based on any research - is that recall should be a lot better from in context data vs. weights in the model. For our use case, precise recall is paramount.
Somewhat tangential, but the EU has a directive mandating electronic invoicing for public procurement.
One of the standards that has come out of that is EN 16931, also known as ZUGFeRD and Factur-X, which basically involves embedding an XML file with the invoice details inside a PDF/A. It allows the PDF to be used like a regular PDF but it also allows the government procurement platforms to reliably parse the contents without any kind of intelligence.
It seems like a nice solution that would solve a lot of issues with ingesting PDFs for accounting if everyone somehow managed to agree a standard. Maybe if EN 16931 becomes more broadly available it might start getting used in the private sector too.
> Unfortunately Gemini really seems to struggle on this, and no matter how we tried prompting it, it would generate wildly inaccurate bounding boxes
Qwen2.5 VL was trained on a special HTML format for doing OCR with bounding boxes. [1] The resulting boxes aren't quite as accurate as something like Textract/Surya, but I've found they're much more accurate than Gemini or any other LLM.
[1] https://qwenlm.github.io/blog/qwen2.5-vl/
>Unfortunately Gemini really seems to struggle on this, and no matter how we tried prompting it, it would generate wildly inaccurate bounding boxes
This is what I have found as well. From what I've read, LLMS do not work well with images for specific details due to image encoders which are too lossy. (No idea if this is actually correct.) For now I guess you can use regular OCR to get bounding boxes.
Modern multimodal encoders for LLMs are fine/not lossy since they do not resize to a small size and can handle arbitrary sizes, although some sizes are obviously better represented in the training set. A 8.5" x 11" paper would be common.
I suspect the issue is prompt engineering related.
> Please provide me strict bounding boxes that encompasses the following text in the attached image? I'm trying to draw a rectangle around the text.
> - Use the top-left coordinate system
> - Values should be percentages of the image width and height (0 to 1)
LLMs have enough trouble with integers (since token-wise integers and text representation of integers are the same), high-precision decimals will be even worse. It might be better to reframe the problem as "this input document is 850 px x 1100 px, return the bounding boxes as integers" then parse and calculate the decimals later.
Just tried this and it did not appear to work for me. Prompt:
>Please provide me strict bounding boxes that encompasses the following text in the attached image? I'm trying to draw a rectangle around the text.
> - Use the top-left coordinate system
>this input document is 1080 x 1236 px. return the bounding boxes as integers
2 replies →
It's clear that OCR & document parsing are going to be swallowed up by these multimodal models. The best representation of a document at the end of the day is an image.
I founded a doc processing company [1] and in our experience, a lot of the difficulty w/ deploying document processing into production is when accuracy requirements are high (> 97%). This is because OCR and parsing is only one part of the problem, and real world use cases need to bridge the gap between raw outputs and production-ready data.
This requires things like:
- state-of-the-art parsing powered by VLMs and OCR
- multi-step extraction powered by semantic chunking, bounding boxes, and citations
- processing modes for document parsing, classification, extraction, and splitting (e.g. long documents, or multi-document packages)
- tooling that lets nontechnical members quickly iterate, review results, and improve accuracy
- evaluation and benchmarking tools
- fine-tuning pipelines that turn reviewed corrections —> custom models
Very excited to get test and benchmark Gemini 2.0 in our product, very excited about the progress here.
[1] https://extend.app/
> It's clear that OCR & document parsing are going to be swallowed up by these multimodal models.
I don’t think this is clear at all. A multimodal LLM can and will hallucinate data at arbitrary scale (phrases, sentences, etc.). Since OCR is the part of the system that extracts the “ground truth” out of your source documents, this is an unacceptable risk IMO.
Seems like you could solve hallucinations by repeating the task multiple times. Non-hallucinations will be the same. Hallucinations will be different. Discard and retry hallucinated sections. This increases cost by a fixed multiple, but if cost of tokens continues to fall that's probably perfectly fine.
If you see above, someone is using a second and even third LLM to correct LLM outputs, I think it is the way to minimize hallucinations.
1 reply →
I think professional services will continue to use OCRs in one way or another, because it's simply too cheap, fast, and accurate. Perhaps, multi-modal models can help address shortcomings of OCRs, like layout detection and guessing unrecognizable characters.
The numbers in the blog post seem VERY inaccurate.
Quick calculation: Input pricing: Image input in 2.0 Flash is $0.0001935. Let's ignore the prompt. Output pricing: Let's assume 500 token per page, which is $0.0003
Cost per page: $0.0004935
That means 2,026 pages per dollar. Not 6,000!
Might still be cheaper than many solutions but I don't see where these numbers are coming from.
By the way, image input is much more expensive in Gemini 2.0 even for 2.0 Flash Lite.
Edit: The post says batch pricing, which would be 4k pages based on my calculation. Using batch pricing is pretty different though. Great if feasible but not practical in many contexts.
Correct, it's with batching Vertex pricing with slightly lower output tokens per page since a lot of pages are somewhat empty in real world docs - I wanted a fair comparison to providers that charge per page.
Regardless of what assumptions you use - it's still an order of magnitude + improvement over anything else.
I've not followed the literature very closely for some time - what problem are they trying to solve in the first place? They write "for documents to be effectively used in RAG pipelines, they must be split into smaller, semantically meaningful chunks". Segmenting each page by paragraphs doesn't seem like a particularly hard vision problem, nor do I see why an OCR system would need to incorporate an LLM (which seem more like a demonstration of overfitting than a "language model" in any literal sense, going by ChatGPT). Perhaps I'm just out of the loop.
Finally, I must point out that statements in the vein of "Why [product] 2.0 Changes Everything" are more often than not a load of humbug.
Great article, I couldn't find any details about the prompt... only the snippets of the `CHUNKING_PROMPT` and the `GET_NODE_BOUNDING_BOXES_PROMPT`.
Is there is any code example with a full prompt available from OP, or are there any references (such as similar GitHub repos) for those looking to get started within this topic?
Your insights would be highly appreciated.
I have some out-of-print books that I want to convert into nice pdf's/epubs (like, reference-quality)
1) I don't mind destroying the binding to get the best quality. How do I do so?
2) I have a multipage double-sided scanner (fujitsu scansnap). would this be sufficient to do the scan portion?
3) Is there anything that determines the font of the book text and reproduces that somehow? and that deals with things like bold and italic and applies that either as markdown output or what have you?
4) how do you de-paginate the raw text to reflow into (say) an epub format that will paginate based on the output device specification?
I think it is very ironic that we chose to use PDF in many fields to archive data because it is a standard and because we would be able to open our pdf documents in 50 or 100 years time. So here we are just a couple of years later facing the challenge of getting the data out of our stupid PDF documents already!
It's not ironic. PDFs are a container, which can hold scanned documents as well as text. Scanned documents need OCR and to be analyzed for their layout. This is not a failing of the PDF format, but a problem inherent to working with print scans.
I don't claim PDF is a good format. It is inscrutable to me.
Pdf is a horrible format. Even if it contains plain text it has no concept of something as simple as paragraphs.
One can wonder how much wonkiness of llms comes from errors in extracting language from pdfs.
Adobe is the most harmful software development company in existence.
1 reply →
Related:
Gemini 2.0 is now available to everyone
https://news.ycombinator.com/item?id=42950454
I work in healthcare domain, We've had great success converting printed lab reports (95%) to Json format using 1.5-Flash model. This post is really exciting for me. will definitely try out 2.0 models.
The struggle which almost every ocr usecase faces is with handwritten documents(doctor prescriptions with bad handwriting) With gemini 1.5 flash we've had ~75-80% percent accuracy (based on random sampling by pharmacists). we're planning to improve this further by fine-tuning gemini models with medical data.
What could be other alternative services/models for accurate handwriting ocr?
> We've had great success converting printed lab reports (95%) to Json format using 1.5-Flash model
Sounds terrifying. How can you be sure that there were no conversion mistakes?
How on earth is anyone ok with 75% accuracy in prescriptions context?!? Or medical anything
That’s literally insane
I'm guessing that human accuracy may be lower or around that value, given that handwritten notes are generally difficult to read. A better metric for document parsing might be accuracy relative to human performance (how much better the LLM performs compared to a human).
Nobody said they're okay with it, nor did they describe what they use the data for.
Hrm I've been using a combo of Textract (for bounding boxes) AI for understanding the contents of the document. Textract is excellent at bounding boxes and exact-text capture, but LLMs are excellent at understanding when a messy/ugly bit of a form is actually one question, or if there are duplicate questions etc.
Correlating the two (Textract <-> AI) output is difficult, but another round of AI is usually good at that. Combined with some text-different scoring and logic, I can get pretty good full-document understanding of questions and answer locations. I've spent a pretty absurd amount of time on this and as of yet have not launched a product with it, but if anyone is interested I'd love to chat about the pipeline!
Been toying with the flash model. Not the top model, but think it'll see plenty use due to the details. Wins on things other than top of benchmark logs
* Generous free tier
* Huge context window
* Lite version feels basically instant
However
* Lite model seems more prone to repeating itself / looping
* Very confusing naming e.g. {model}-latest worked for 1.5 but now its {model}-001? The lite has a date appended, the non-lite does not. Then there is exp and thinking exp...which has a date. wut?
> * Huge context window
But how well does it actually handle that context window? E.g. a lot of models support 200K context, but the LLM can only really work with ~80K or so of it before it starts to get confused.
it works REALLY well. I have used it to dump many references codes and then help me write a new modules etc. I have gone up to 200k tokens I think with no problems in recall.
1 reply →
I'm sure someone will do a haystack test, but from my casual testing it seems pretty good
There is the needle in the haystack measure which is, as you probably guessed, hiding a small fact in a massive set of tokens and asking it to recall it.
Recent Gemini models actually do extraordinarily well.
https://cloud.google.com/blog/products/ai-machine-learning/t...
It works okay out to roughly 20-40k tokens. Once the window gets larger than that, it degrades significantly. You can needle in the haystack out to that distance, but asking it for multiple things from the document leads to hallucinations for me.
Ironic, but GPT4o works better for me at longer contexts <128k than Gemini 2.0 flash. And out to 1m is just hopeless, even though you can do it.
My experience is that Gemini works relatively well on larger contexts. Not perfect, but more reliable.
Ingesting PDFs accurately is a noble goal which will no doubt be solved as LLMs get better. However, I need to point out that the financial statement example used in the article already has a solution: iXBRL.
Many financial regulators require you to publish heavily marked up statements with iXBRL. These markups reveal nuances in the numbers that OCRing a post processed table will not understand.
Of course, financial documents are a narrow subset of the problem.
Maybe the problem is with PDF as a format: Unfortunately PDFs lose that meta information when they are built from source documents.
I can't help but feel that PDFs could probably be more portable as their acronym indicates.
Just call out -- even better, this library (even in active development) is blowing every other SEC tool I've found out the of the water
https://github.com/dgunning/edgartools
Glad Gemini is getting some attention. Using it is like a superpower. There are so many discussions about ChatGTP, Claude, DeepSeek, Llama, etc. that don't even mention Gemini.
Before 2.0 models their offerings were pretty underwhelming, but now they can certainly hold their own. I think Gemini will ultimately be the LLM that eats the world, Google has the talent and most importantly has their own custom hardware (hence why their prices are dirt cheap and context is huge).
Google had a pretty rough start compared to ChatGPT, Claude. I suspect that left a bad taste in many people's mouths. In particular because evaluating so many LLM's is a lot of effort on its own.
Llama and DeepSeek are no-brainers; the weights are public.
No brainer if you're sitting on a >$100k inference server.
4 replies →
Google was not serious about LLMs, they could not even figure what to call it. There is always a risk that they will get bored and just kill the whole thing.
I tried using Gemini 2.0 Flash for PDF-to-Markdown parsing of scientific papers after having good results with GPT-4o, but the experience was terrible.
When I sent images of PDF page with extracted text, Gemini mixed headlines with body text, parsed tables incorrectly, and sometimes split tables—placing one part at the top of the page and the rest at the bottom. It also added random numbers (like inserting an “8” for no reason).
When using the Gemini SDK to process full PDFs, Gemini 1.5 could handle them, but Gemini 2.0 only processed the first page. Worse, both versions completely ignored tables.
Among the Gemini models, 1.5 Pro performed the best, reaching about 80% of GPT-4o’s accuracy with image parsing, but it still introduced numerous small errors.
In conclusion, no Gemini model is reliable for PDF-to-Markdown parsing and beyond the hype - I still need to use GPT-4o.
there is also https://ds4sd.github.io/docling/ from ibm research which is mit license and track bounding boxes as rich json format
Docling has worked well for me. It handles scenarios that crashed ChatGPT Pro. Only problem is it's super annoying to install. When I have a minute I might package it for homebrew.
Did you compare it to tesseract?
If it's superior (esp. for scans with text flowing around image boxes), and if you do end up packaging it up for brew, know that there's at least one developer who will benefit from your work (for a side-project, but that goes without saying).
Thanks in advance!
I have seen no decent program that can read, OCR, and analyze, and tabulate data correctly from very large PDF files with a lot of scanned information from different sources. I run my practice with pdf files- one for each patient. It is a treasure trove of actionable data. PDF filing in this manner allows me to finish my daily tasks in 4 hrs instead of 12 hrs! For sick patients who need information at the point of care, PDF has numerous advantages over usual hospital EHR portals, etc. If any smart Engineer/s are interested in working with me, please connect with me
I can help as can many others. Probably a good place to start though is with some of the more recent off the shelf solutions like trellis (I have no affiliation with them).
One major takeaway that matches my own investigation is that Gemini 2.0 still materially struggles with bounding boxes on digital content. Google has published[1] some great material on spatial understanding and bounding boxes on photography, but identifying sections of text or digital graphics like icons in a presentation is still very hit and miss.
--
[1]: https://github.com/google-gemini/cookbook/blob/a916686f95f43...
Have you seen any models that perform better at this? I last looked into this a year ago but at the time they were indeed quite bad at it across the board.
What would change "everything" is if we managed to switch to "real" digital parseable formats instead of this dead tree emulation that buries all data before the arrival of AI...
This is what I am trying to figure out how to solve.
My problem statement is:
- Injest PDFs, summarize, and extract important information.
- Have some way to overlay the extracted information on the pdf in the UI.
- User can provide feedback on the overlaid info by accepting or rejecting the highlights as useful or not.
- This info goes back in to the model for reinforced learning.
Hoping to find something that can make this more manageable.
Most PDF parsers give you coordinate data (bounding boxes) for extracted text. Use these to draw highlights over your PDF viewer - users can then click the highlights to verify if the extraction was correct.
The tricky part is maintaining a mapping between your LLM extractions and these coordinates.
One way to do it would be with two LLM passes:
Not the cheapest approach since you're hitting the API twice, but it's straightforward!
Here's a PR thats not accepted yet for some reason that seems to be having some success with the bounding boxes
https://github.com/getomni-ai/zerox/pull/44
Related to
https://github.com/getomni-ai/zerox/issues/7
Have you tried cursor or replit for this?
I’ve been very reluctant to use closed source LLMs. This might actually convince me to use one. I’ve done so many attempts at pdf parsing over the years. It’s awful to deal with. 2 column format omg. Most don’t realize that pdfs contain instructions for displaying the document and the content is buried in there. It’s just always been a problematic format.
So if it works, I’d be a fool not to use it.
Two years ago, I worked for a company that had its own proprietary AI system for processing PDFs. While the system handled document ingestion, its real value was in extracting and analyzing data to provide various insights. However, one key requirement was rendering documents in HTML with as close to a 1:1 likeness as possible.
At the time, I evaluated multiple SDKs for both OCR and non-OCR PDF conversions, but none matched the accuracy of Adobe Acrobat’s built-in solution. In fact, at one point (don’t laugh), the company resorted to running Adobe Acrobat on a Windows machine with automation tools to handle the conversion. Using Adobe’s cloud service for conversion was not an option due to the proprietary nature of the PDFs. Additionally, its results were inconsistent and often worse compared to the desktop version of Adobe Acrobat!
Given that experience, I see this primarily as an HTML/text conversion challenge. If Gemini 2.0 truly improves upon existing solutions, it would be interesting to see a direct comparison against popular proprietary tools in terms of accuracy.
We started with using LLMs for parsing at Tensorlake (https://docs.tensorlake.ai), tried Qwen, Gemini, OpenAI, pretty much everything under the sun. My thought was we could skip 5-6 years of development IDP companies have done on specialized models by going to LLMs.
On information dense pages, LLMs often hallucinate half of the times, they have trouble understanding empty cells in tables, doesn't understand checkboxes, etc.
We had to invest heavily into building a state of the art layout understanding model and finally a table structure understanding for reliability. LLMs will get there, but there are some ways to go there.
Where they do well is in VQA type use cases, ask a question, very narrowly scoped, they will work much better than OCR+Layout models, because they are much more generalizable and flexible to use.
(Disclosure, CEO of Aryn (https://aryn.ai/) here)
Good post. VLM models are improving and Gemini 2.0 definitely changes the doc prep and ingestion pipeline across the board.
What we're finding as we work with enterprise customers:
1. Attribution is super important, and VLMs are there yet. Combining them with layout analysis makes for a winning combo.
2. VLMs are great at prompt-based extraction, but if you have document automation and you don't know where in tables you'll be searching or need to reproduce faithfully -- then precise table extraction is important.
3. VLMs will continue to get better, but the price points are a result of economies of scale that document parsing vendors don't get. On the flip side, document parsing vendors have deployment models that Gemini can't reach.
Shameless plug: I'm working on a startup in this space.
But the bounding box problem hits close to home. We've found Unstructured's API gives pretty accurate box coordinates, and with some tweaks you can make them even better. The tricky part is implementing those tweaks without burning a hole in your wallet.
How is their API priced? I checked a few months ago and remembered it being expensive.
Better have a look at
- https://mathpix.com/
- Docling : https://ds4sd.github.io/docling/
Hmm I have been doing a but if this manually lately for a personal project. I am working on some old books that are far past any copyright, but they are not available anywhere on the net. (Being in Norwegian m makes a book a lot more obscure) so I have been working on creating ebooks out of them.
I have a scanner, and some OCR processes I run things through. I am close to 85% from my automatic process.
The pain of going from 85% to 99% though is considerable. (and in my case manual) (well Perl helps)
I went to try this AI on one of the short poem manufscript I have.
I told the prompt I wanted PDF to Markdown, it says sure go ahead give me the pdf. I went upload it. It spent a long time spinning. then a quick messages comes up, something like
"Failed to count tokens"
but it just flashes and goes away.
I guess the PDF is too big? Weird though, its not a lot of pages.
I experienced something similar. My use case is I need to summarize bank statements (sums, averages, etc.). Gemini wouldn't do it, it said too many pages. When I asked the max number of supported pages, it says max is 14 pages. Attempted on both 2.0 flash and 2.0 pro in VertexAI console.
Try with https://aistudio.google.com Think the page limit is a vertex thing The only limit in reality is the number of input tokens taken to parse the pdf. If those tokens + tokens for the rest of your prompt are under the context window limit, you're good.
Take a screenshot of the pdf page and give that to the LLM and see if it can be processed.
Your PDF might have some quirks inside which the LLM cannot process.
Wonder how this compares to Docling. So far that's been the only tool that really unlocked PDFs for me. It's solid but really annoying to install.
https://ds4sd.github.io/docling/
This is completely tangential, but does anyone know if AI is creating any new jobs?
Thinking of the OCR vendors who get replaced. Where might they go?
One thing I can think of is that AI could help the space industry take off. But wondering if there are any concrete examples of new jobs being created.
> Thinking of the OCR vendors who get replaced. Where might they go?
We are solving more complicated document types, in more languages, longer in size. The scope of work expanded a lot.
I've built a simple OCR tool with gemini 2 flash with several options: 1-Simple OCR: Extracts all detected text from uploaded files 2-Advanced OCR: Enables rule-based extraction (e.g., table data) 3-Bulk OCR: Designed for processing multiple files at once The project will be open-source next week. You can try the tool here: https://gemini2flashocr.netlify.app
I think very soon a new model will destroy whatever startups and services are built around document ingestion. As in a model that can take in a pdf page as a image and transcribe it to text with near perfect accuracy.
Extracting plain text isn’t that much of a problem, relatively speaking. It’s interpreting more complex elements like nested lists, tables, side bars, footnotes/endnotes, cross-references, images and diagrams where things get challenging.
OCR is not 100% either. Reading order is also fragile, it might OCR the word but mess up the line structure.
I think the Azure Document Intelligence, Google Document AI and Amazon Textract are among the best if not the best services though and they offer these models.
I have not tested Azure Document Intelligence, Google Document AI, but AWS Textract, LLamaparse, Unstructured and Omni made to my shortlist. I have not tested Docling, as I could not install it on my Windows laptop.
They do not test Llamaparse on the accuracy benchmark. In my personal experience Llamaparse was one of the rare tools that always got the right information. Also, the accuracy is only based on tables and we had issues with irregular text structures as well. It is also worth noting that when using an LLM, a non-deterministic tool to do something deterministic is a bit risky and you need to write, modify and maintain a prompt.
Gemini Flash 2.0 is impressive but it hardly captures all of the information in the PDF. It's great for getting vibes from the document or finding overall information in it. If you ask it to e.g. enumerate every line item from multiple tables in a long PDF it still falls flat (dropping some line items or entire sections etc). DocuPanda and to a lesser extent Unstrucutred handle this.
I wish more PDFs were generated as hybrid PDFs. These are PDFs that also include their original source material. Then you have a document whose format is fixed, but if you need more semantic information, there it is!
LibreOffice makes this especially easy to do: https://wiki.documentfoundation.org/Faq/Writer/PDF_Hybrid
I wonder how this compares to open source models (which might be less accurate but even cheaper if self-hosted?), e.g. Llama 3.2. I'll see if I can run the benchmark.
Also regarding the failure case in the footnote, I think Gemini actually got that right (or at least outperformed Reducto) - the original document seems to have what I call a "3D" table where the third axis is rows within each cell, and having multiple headers is probably the best approximation in Markdown.
Everything I tried previously had very disappointing results. I was trying to get rid of Azure's DocumentIntelligence, which is kind of expensive at scale. The models could often output a portion of a table, but it was nearly impossible to get them to produce a structured output of a large table on a single page; they'd often insert "...rest of table follows" and similar terminations, regardless of different kinds of prompting.
Maybe incremental processing of chunks of the table would have worked, with subsequent stitching, but if Gemini can just process it that would be pretty good.
I'm failing to understanding the Ingesting part of the Gemini 2.0? Does Gemini provide the a process to convert PDFs to Markdown API OR the LLM APIs handle it with prompt "Extract the Attached PDF" using this API: https://ai.google.dev/gemini-api/docs/document-processing?la...
Orthogonal to this post, but this just highlights the need for a more machine readable PDF alternative.
I get the inertia of the whole world being on PDF. And perhaps we can just eat the cost and let LLMs suffer the burden going forwards. But why not use that LLM coding brain power to create a better overall format?
I mean, do we really see printing things out onto paper something we need to worry about for the next 100 years? It reminds me of the TTY interface at the heart of Linux. There was a time it all made sense, but can we just deprecate it all now?
PDF does support incorporating information about the logical document structure, aka Tagged PDF. It’s optional, but recommended for accessibility (e.g. PDF/UA). See chapters 14.7–14.8 in [1]. Processing PDF files as rendered images, as suggested elsewhere in this thread, can actually dramatically lose information present in the PDF.
Alternatively, XML document formats and the like do exist. Indeed, HTML was supposed to be a document format. That’s not the problem. The problem is having people and systems actually author documents in that way in an unambiguous fashion, and having a uniform visual presentation for it that would be durable in the long term (decades at least).
PDF as a format persists because it supports virtually every feature under the sun (if authors care to use them), while largely guaranteeing a precisely defined visual presentation, and being one of the most stable formats.
[1] https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...
I'm not suggesting we re-invent RDF or any other kind of semantic web idea. And the fact that semantic data can be stored in a PDF isn't really the problem being solved by tools such as these. In many cases, PDF is used for things like scanned documents where adding that kind of metadata can't really be done manually - in fact the kinds of tools suggested in the post would be useful for adding that metadata to the PDF after scanning (for example).
Imagine you went to a government office looking for some document from 1930s, like an ancestors marriage or death certificate. You might want to digitize a facsimile of that using a camera or a scanner. You have a lot of options to store that, JPG, PNG, PDF. You have even more options to store the metadata (XML, RDF, TXT, SQLite, etc.). You could even get fancy and zip up an HTML doc alongside a directory of images/resources that stitched them all together. But there isn't really a good standard format to do that.
It is the second part of you post that stands out - the kitchen sink nature of PDFs that make them so terrible. If they were just wrappers for image data, formatted in a way that made printing them easy, I probably wouldn't dislike them.
2 replies →
Fixed link: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...
Strange that LlamaParse is mentioned in the pricing table but not the results. We’ve used them to process a lot of pages and it’s been excellent each time.
I really wish that Google made an endpoint that's compatible with the OpenAI API. That'd make trying Gemini in existing flows so much easier.
I believe this is already the case, at least the Python libraries are compatible, if not recommended for more than just trying things out:
https://ai.google.dev/gemini-api/docs/openai
How well do they work when you want to do things like grounding with search?
Is that not this? https://ai.google.dev/api/compatibility
OCR makes sense, but it is another asking for a summary. It is not there yet, gave a lot of incorrect details.
Is there an AI platform where I can paste a snip of a graph and it will generate a n th order polynomial regression for me of the trace?
Either ChatGPT o4 or one of the newer Google models should handle that, since it's a pretty common task. Actually there have been online curve fitters for several years that work pretty well without AI, such as https://curve.fit/ and https://www.standardsapplied.com/nonlinear-curve-fitting-cal... .
I'd probably try those first, since otherwise you're depending on the language model to do the right thing automagically.
I've had decent luck using some of the reasoning models for this. It helps if you task them with identifying where the points on the graph are first.
RE: the loss of bounding box information
You can recover word-level bounding boxes and confidence scores by using a traditional OCR engine such as AWS Textract and matching the results to Gemini’s output – see https://docless.app for a demo (disclaimer: I am the founder)
If is is a vendor work, you should probably hire person who are competitive in software engineering space. And do we actually need significant amount of processing as a solution? If this is the case, common markdowned public pdfs should be open-sourced. We shouldn't repeat other's work.
Despite that, cheaper is better.
If the end goal is just rag or search over the pdfs, seems like ColPali based embedding search would be a good alternative here. Don’t process the PDFs, instead just search their image embedding directly. From what I understand, you also get a sort of attention as to what part of the image is being activated by the search.
Has anyone in the AEC industry who's reading this worked out a good way to get Bluebeam MEP, electrical layouts into Revit (LOD 200-300).
Have seen MarkupX as a paid option, but it seems some AI in the loop can greatly speed up exception handling, encode family placement to certain elevations based on building code docs....
Curious to see how well this works on technical/mechanical documentation (manuals parts list etc). Has any one tried? My company Airwave had to jump through all sorts of hoops to get accurate information for our use case: getting accurate info to the technicians in the field.
ritvik here from pulse. everyone’s pretty much made the right points here, but wanted to emphasize that due to the llm architecture, they predict “the most probable text string” that corresponds to the embedding, not necessarily the exact text. this non-deterministicness is awful for customers deploying in production and a lot of our customers complained about this to us initially. the best approach is to build a sort-of “agent”-based VLM x traditional layout segmentation/reading order algos, which is what we’ve done and are continuing to do.
we have a technical blog on this exact phenomena coming out in the next couple days, will attach it here when it’s out!
check us out at https://www.runpulse.com
I'm building a system that does regular OCR and outputs layout-following ASCII; in my admittedly limited tests it works better than most existing offerings.
It will be ready for beta testing this week or the next, and I will be looking for beta testers; if interested please contact me!
I think this is one of the few functional applications of LLMs that is really undeniably useful.
OCR has always been “untrustworthy” (as in you cannot expect it to be 100% correct and know you must account for that) and we have long used ML algorithms for the process.
It is not OCR to blame, when you have garbage in you should not expect anything of high quality, especially with handwriting and tables and different languages. Even human beings fail to understand some documents (see doctor's prescriptions)
If OCR is a solution designed to recognize documents and it does not recognize all documents, then it is an imperfect solution.
That is not to say there is a perfect solution, but it is still the fault of the solution.
1 reply →
The article mentions OCR, but you're sending a PDF how is that OCR? Or is this is mistake? What if you send photos of the pages, that would be true OCR - does the performance and price remain the same?
If so this unlocks a massive workflow for us.
Anyone know if there are uses of this with PHI? Most doctors still fax reports to each other and this would help a lot to drop the load on staff when receiving and categorizing/assigning to patients
> Crucially, we’ve seen very few instances where specific numerical values are actually misread.
"Very few" is way too many. This means it cannot be trusted, especially when it comes to financial data.
Gemini is amazing but I get this copyright error for some documents and I have a rate limit of just 10 requests per minute. Same issues with claude except the copyright error is called content warning.
90% accuracy +/- 10%? What could that be useful for, that’s awfully low.
> accuracy is measured with the Needleman-Wunsch algorithm
> Crucially, we’ve seen very few instances where specific numerical values are actually misread. This suggests that most of Gemini’s “errors” are superficial formatting choices rather than substantive inaccuracies. We attach examples of these failure cases below [1].
> Beyond table parsing, Gemini consistently delivers near-perfect accuracy across all other facets of PDF-to-markdown conversion.
That seems fairly useful to me, no? Maybe not for mission critical applications, but for a lot of use cases, this seems to be good enough. I'm excited to try these prompts on my own later.
This is "good enough" for Banks to use when doing due diligence. You'd be surprised how much noise is in the system with the current state of the art: algorithms/web scrapers and entire buildings of humans in places like India.
It's certainly pretty useful for discovery/information filtering purposes. I.e. searching for signal in the noise if you have a large dataset.
due diligence of this sort?
https://en.wikipedia.org/wiki/Know_your_customer
1 reply →
would encourage you to take a look at some of the real data here! https://huggingface.co/spaces/reducto/rd_table_bench
you'll find that most of the errors here are structural issues with the table or inability to parse some special characters. tables can get crazy!
Author here — measuring accuracy in table parsing is surprisingly challenging. Subtle, almost imperceptible differences in how a table is parsed may not affect the reader's understanding but can significantly impact benchmark performance. For all practical purposes, I'd say it's near perfect (also keep in mind the benchmark is on very challenging tables).
I guess 90% is for "benchmark", which is typically tailored to be challenging to parse.
having seen some of these tables, I would guess that's probably above a layperson's score . Some are very complicated or just misleadingly structured.
Switching from manual data entry to approval
There’s AWS Bedrock Knowledge Base (Amazon proprietary RAG solution) which can digest PDFs and, as far as I tested it on real world documents, it works pretty well and is cost effective.
How does the Gemini OCR perform against non-English language text?
I've been working on something similar the past couple months. A few thoughts:
- A lot of natural chunk boundaries span multiple pages, so you need some 'sliding window' mechanism for the best accuracy.
- Passing the entire document hurts throughput too much due to the quadratic complexity of attention. Outputs are also much worse when you use too much context.
- Bounding boxes can be solved by first generating boxes using tradition OCR / layout recognition, then passing that data to the LLM. The LLM can then link it's outputs to the boxes. Unfortunately getting this reliable required a custom sampler so proprietary models like Gemini are out of the question.
How is it for image recognition/classification? OCR can be a huge chunk of the image classification pipeline. Presumably, it works just as well in this domain?
Damn, I thought this was about the Gemini protocol.
https://geminiprotocol.net/
Why is Gemini Flash so much cheaper than other models here?
probably a mix of economies of scale (google workspace and search are already massive customers of these models meaning the build out is already there), and some efficiency dividends from hardware r&d (google has developed the model and the TPU hardware purpose built to run it almost in parallel)
I've built a simple OCR tool with Gemini 2 flash you can test it here :gemini2flashocr.netlify.app
We've previously tried Sonnet in our PDF extraction pipelines. It was very, very accurate, gpt-4o did not come close. Its more expensive, however.
Will 2.0.1 also change everything?
How about 2.0.2?
How about Llama 13.4.0.1?
This is tiring. It's always the end of the world when they release a new version of some LLM.
prompt and pray, this is my default mode while working with LLMs
This is super interesting.
Would this be suitable for ingesting and parsing wildly variable unstructured data into a structured schema?
Why are traditional OCRs better in terms of hallucination and confidence scores?
Can we use logprobs of LLM as confidence scores?
Traditional OCRs are trained for a single task: recognize characters. They do this through visual features (and sometimes there's an implicit (or even explicit) "language" model: see https://arxiv.org/abs/1805.09441). As such, the extent of their "hallucination", or errors, is when there's ambiguity in characters, e.g. 0 vs O (that's where the implicit language model comes in). Because they're trained with a singular purpose, you would expect their confidence scores (i.e. logprobs) to be well calibrated. Also, depending on the OCR model, you usually do a text detection (get bounding boxes) followed by a text recognition (read the characters), and so it's fairly local (you're only dealing with a small crop).
On the other hand, these VLMs are very generic models – yes, they're trained on OCR tasks, but also a dozen of other tasks. As such, they're really good OCR models, but they tend to be not as well calibrated. We use VLMs at work (Qwen2-VL to be specific), and we don't find it hallucinates that often, but we're not dealing with long documents. I would assume that as you're dealing with a larger set of documents, you have a much larger context, which increases the chances of the model getting confused and hallucinating.
Would you recommend using these large models for parsing sensitive data - probably say bank statements etc?
I wish I could do this locally. I don't feel comfortable uploading all of my private documents to Google.
Does anyone have some fleshed out source code, prompts and all, to try this on Gemini 2.0?
Also really interested in this
Okay I just checked/tried this out with my own use case at work and it's insane.
We have been building smaller and more efficient VLMs for document extraction from way before and we are 10x faster than unstructured,reducto (the ocr vendors) with an accuracy of 90%.
P.S. - You can find us here (unsiloed-ai.com) or you can reach out to me on adnan.abbas@unsiloed-ai.com
In what contexts is 0.84 ± 0.16 actually "nearly perfect"?
I think they meant relative to the best other approach, which is Reducto’s given that they are the creators of the benchmark:
Reducto's own model currently outperforms Gemini Flash 2.0 on this benchmark (0.90 vs 0.84). However, as we review the lower-performing examples, most discrepancies turn out to be minor structural variations that would not materially affect an LLM’s understanding of the table.
Is this something we can run locally? if so what's the license?
Gemini are Google cloud/service models. Gemma are the Google local models.
Ok got it, thanks. Is it a direct mapping?
Well, probably not literally "everything".
He found the one thing that Gemini does better.
For data extraction from long documents (100k+ tokens) how does structured outputs via providing a json schema compare vs asking one question per field (in natural language)?
Also I've been hearing good things regarding document retrieval about Gemini 1.5 Pro, 2.0 Flash and gemini-exp-1206 (the new 2.0 Pro?), which is the best Gemini model for data extraction from 100k tokens?
How do they compare against Claude Sonnet 3.5 or the OpenAI models, has anyone done any real world tests?
Imagine there's no PostScript
It's easy if you try
No pdfs below us
Above us only SQL
Imagine all the people livin' for CSV
Hi all - CEO of chunkr.ai here.
The write-up and ensuing conversation are really exciting. I think out of everything mentioned here - the clear stand-out point is that document layout analysis (DLA) is the crux of the issue for building practical doc ingestion for RAG.
(Note: DLA is the process of identifying and bounding specific segments of a document - like section headers, tables, formulas, footnotes, captions, etc.)
Strap in - this is going to be a longy.
We see a lot of people and products basically sending complete pages to LVLMs for converting to a machine-readable format, and for chunking. We tried this + it’s a possible configuration on chunkr as well. It has never worked for our customers, or during extensive internal testing across documents from a variety of verticals. Here are SOME of the common problems:
- Most documents are dense. The model will not OCR everything and miss crucial parts.
- A bunch of hallucinated content thats tough to catch.
- Occasionally it will just refuse to give you anything. We’ve tried a bunch of different prompting techniques and the models return “<image>” or “||..|..” for an ENTIRE PAGE of content.
Despite this - it’s obvious that these ginormous neural nets are great for complex conversions like tables and formulas to HTML/Markdown & LateX. They also work great for describing images and converting charts to tables. But that’s the thing - they can only do this if you can pull out these document features individually as cropped images and have the model focus on small snippets of the document rather than the full page.
If you want knobs for speed, quality, and cost, the best approach is to work at a segment level rather than a page level. This is where DLA really shines - the downstream processing options are vast and can be fit to specific needs. You can choose what to process with simple + fast OCR (text-only segments like headers, paragraphs, captions), and what to send to a large model like Gemini (complex segments like tables, formulas, and images) - all while getting juicy bounding boxes for mapping citations. Combine this with solid reading order algos - and you get amazing layout-aware chunking that takes ~10ms.
We made RAG apps ourselves and attempted to index all ~600 million pages of open-access research papers for https://lumina.sh. This is why we built Chunkr - and it needed to be Open Source. You can self-host our solution and process 4 pages per second, scaling up to 11 million pages per month on a single RTX 4090, renting this hardware on Runpod costs just $249/month ($0.34/hour).
A VLM to do DLA sounds awesome. We've played around with this idea but found that VLMs don't come close to models where the architecture is solely geared toward these specific object detection tasks. While it would simplify the pipeline, VLMs are significantly slower and more resource-hungry - they can't match the speed we achieve on consumer hardware with dedicated models. Nevertheless, the numerous advances in the field are very exciting - big if true!
A note on costs:
There are some discrepancies between the API pricing of providers listed in this thread. Assuming 100000 pages + feature parity:
Chunkr API - 200 pages for $1, not 100 pages
AWS Textract - 40 pages for $1, not 1000 pages (No VLMs)
Llama Parse - 13 pages for $1, not 300
A note on RD-Bench:
We’ve been using Gemini 1.5 Pro for tables and other complex segments for a while, so the RD-bench is very outdated. We ran it again on a few hundred samples and got a 0.81 (also includes some notes on the bench itself). To the OP: it would be awesome if you could update your blog post!
https://github.com/lumina-ai-inc/chunkr-table-rdbench/tree/m...
Hi
Remember all the hyperbole a year ago on how Google was failing and over?
Anyone who cries “<service> is dead” after some new technology is introduced is someone you can safely ignore. For ever. They’re hyperbolic clout chasers who will only ever be right by mistake.
As if, when ChatGPT was introduced, Google would just stay still, cross their arms, and say “well, this is based on our research paper but there’s nothing we can do, going to just roll over and wait for billions of dollars to run out, we’re truly doomed”. So unbelievably stupid.
fds
> Why Gemini 2.0 Changes Everything
Clickbait. It doesn't change "everything". It makes ingestion for RAG much less expensive (and therefore feasible in a lot more scenarios), at the expense of ~7% reduction in accuracy. Accuracy is already rather poor even before this, however, with the top alternative clocking in at 0.9. Gemini 2.0 is 0.84, although the author seems to suggest that the failure modes are mostly around formatting rather than e.g. mis-recognition or hallucinations.
TL;DR: is this exciting? If you do RAG, yes. Does it "change everything" nope. There's still a very long way to go. Protip for model designers: accuracy is always in greater demand than performance. A slow model that solves the problem is invariably better than a fast one that fucks everything up.
In this use-case, accuracy is non-negotiable with zero room for any hallucination.
Overall it changes nothing.
And people always have a hard time understanding what a certain degree of accuracy actually means. E.g. when you hear that a speech recognition system has 95% accuracy (5% WER), it means that it gets every 19th word wrong. That's abysmally bad by human standards - errors in every other sentence. That does not mean it's useless, but you do need to understand very clearly what you're dealing with, and what those errors might do to the rest of your system.
Cool
Now, I could look at this relatively popular post about Google and revise my opinion of HN as an echo chamber, but I’m afraid it’s just that the downvote loving HNers weren’t able to make the cognitive leap from Gemini to Google.
CTO of Reducto here. Love this writeup!
We’ve generally found that Gemini 2.0 is a great model and have tested this (and nearly every VLM) very extensively.
A big part of our research focus is incorporating the best of what new VLMs offer without losing the benefits and reliability of traditional CV models. A simple example of this is we’ve found bounding box based attribution to be a non-negotiable for many of our current customers. Citing the specific region in a document where an answer came from becomes (in our opinion) even MORE important when using large vision models in the loop, as there is a continued risk of hallucination.
Whether that matters in your product is ultimately use case dependent, but the more important challenge for us has been reliability in outputs. RD-TableBench currently uses a single table image on a page, but when testing with real world dense pages we find that VLMs deviate more. Sometimes that involves minor edits (summarizing a sentence but preserving meaning), but sometimes it’s a more serious case such as hallucinating large sets of content.
The more extreme case is that internally we fine tuned a version of Gemini 1.5 along with base Gemini 2.0, specifically for checkbox extraction. We found that even with a broad distribution of checkbox data we couldn’t prevent frequent checkbox hallucination on both the flash (+17% error rate) and pro model (+8% error rate). Our customers in industries like healthcare expect us to get it right, out of the box, deterministically, and our team’s directive is to get as close as we can to that ideal state.
We think that the ideal state involves a combination of the two. The flexibility that VLMs provide, for example with cases like handwriting, is what I think will make it possible to go from 80 or 90 percent accuracy to some number very close 99%. I should note that the Reducto performance for table extraction is with our pre-VLM table parsing pipeline, and we’ll have more to share in terms of updates there soon. For now, our focus is entirely on the performance frontier (though we do scale costs down with volume). In the longer term as inference becomes more efficient we want to move the needle on cost as well.
Overall though, I’m very excited about the progress here.
--- One small comment on your footnote, the evaluation script with Needlemen-Wunsch algorithm doesn’t actually consider the headers outputted by the models and looks only at the table structure itself.
> deterministically
How are you planning to do this?
[dead]
[dead]
[flagged]
Google's models have historically been total disappointments compared to chatGPT4. Worse quality, wont answer medical questions either.
I suppose I'll try it again, for the 4th or 5th time.
This time I'm not excited. I'm expecting it to be a letdown.
Following this post
You know what'd be fucking nice? The ability to turn Gemini off.
hAIters gonna hAIte