Comment by lazypenguin
17 days ago
I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large context window model in terms of ease-of-use. Ironically this vendor is the best known and most successful vendor for OCR'ing this specific type of PDF but many of our requests failed over to their human-in-the-loop process. Despite it not being their specialization switching to Gemini was a no-brainer after our testing. Processing time went from something like 12 minutes on average to 6s on average, accuracy was like 96% of that of the vendor and price was significantly cheaper. For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair". We probably could improve our prompt to clean up this data even further. Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.
Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!
This is spot on, any legacy vendor focusing on a specific type of PDF is going to get obliterated by LLMs. The problem with using an off-the-shelf provider like this is, you get stuck with their data schema. With an LLM, you have full control over the schema meaning you can parse and extract much more unique data.
The problem then shifts from "can we extract this data from the PDF" to "how do we teach an LLM to extract the data we need, validate its performance, and deploy it with confidence into prod?"
You could improve your accuracy further by adding some chain-of-thought to your prompt btw. e.g. Make each field in your json schema have a `reasoning` field beforehand so the model can CoT how it got to its answer. If you want to take it to the next level, `citations` in our experience also improves performance (and when combined with bounding boxes, is powerful for human-in-the-loop tooling).
Disclaimer: I started an LLM doc processing infra company (https://extend.app/)
> The problem then shifts from "can we extract this data from the PDF" to "how do we teach an LLM to extract the data we need, validate its performance, and deploy it with confidence into prod?"
A smart vendor will shift into that space - they'll use that LLM themselves, and figure out some combination of finetunes, multiple LLMs, classical methods and human verification of random samples, that lets them not only "validate its performance, and deploy it with confidence into prod", but also sell that confidence with an SLA on top of it.
That's what we did with our web scraping saas - with Extraction API¹ we shifted web scraped data parsing to support both predefined models for common objects like products, reviews etc. and direct LLM prompts that we further optimize for flexible extraction.
There's definitely space here to help the customer realize their extraction vision because it's still hard to scale this effectively on your own!
1 - https://scrapfly.io/extraction-api
What's the value for a customer to pay a vendor that is only a wrapper around an LLM when they can leverage LLMs directly? I imagine tools being accessible for certain types of users, but for customers like those described here, you're better off replacing any OCR vendor with your own LLM integration
Software is dead, if it isn't a prompt now, it will be a prompt in 6 months.
Most of what we think software is today, will just be a UI. But UIs are also dead.
19 replies →
>A smart vendor will shift into that space - they'll use that LLM themselves
It's a bit late to start shifting now since it takes time. Ideally they should already have a product on the market.
5 replies →
I have some out-of-print books that I want to convert into nice pdf's/epubs (like, reference-quality)
1) I don't mind destroying the binding to get the best quality. Any idea how I do so?
2) I have a multipage double-sided scanner (fujitsu scansnap). would this be sufficient to do the scan portion?
3) Is there anything that determines the font of the book text and reproduces that somehow? and that deals with things like bold and italic and applies that either as markdown output or what have you?
4) how do you de-paginate the raw text to reflow into (say) an epub or pdf format that will paginate based on the output device (page size/layout) specification?
Great, I landed on the reasoning and citations bit through trial and error and the outputs improved for sure.
`How did you add bounding boxes, especially if it is variety of files?
In my open source tool http://docrouter.ai I run both OCR and LLM/Gemini, using litellm to support multiple LLMs. The user can configure extraction schema & prompts, and use tags to select which prompt/llm combination runs on which uploaded PDF.
LLM extractions are searched in OCR output, and if matched, the bounding box is displayed based on OCR output.
Demo: app.github.ai (just register an account and try) Github: https://github.com/analytiq-hub/doc-router
Reach out to me at andrei@analytiqhub.com for questions. Am looking for feedback and collaborators.
So why should I still use Extend instead of Gemini?
How do you handle the privacy of the scanned documents?
With the docrouter.ai, it can be installed on prem. If using the SAAS version, users can collaborate in separate workspaces, modeled on how Databricks supports workspaces. Back end DB is Mongo, which keeps things simple.
One level of privacy is the workspace level separation in Mongo. But, if there is customer interest, other setups are possible. E.g. the way Databricks handles privacy is by actually giving each account its own back end services - and scoping workspaces within an account.
That is a good possible model.
We work with fortune 500s in sensitive industries (healthcare, fintech, etc). Our policies are:
- data is never shared between customers
- data never gets used for training
- we also configure data retention policies to auto-purge after a time period
2 replies →
> After trial and error with different models
As a mere occasional customer I've been scanning 4 to 5 pages of the same document layout every week in gemini for half a year, and every single week the results were slightly different.
To note the docs are bilingual so it could affect the results, but what stroke me is the lack of consistency, and even with the same model, running it two or three times in a row gives different results.
That's fine for my usage, but that sounds like a nightmare if everytime Google tweaks their model, companies have to reajust their whole process to deal with the discrepancies.
And sticking with the same model for multiple years also sound like a captive situation where you'd have to pay premium for Google to keep it available for your use.
Consider turning down the temperature in the configuration? LLMs have a bit of randomness in them.
Gemini 2.0 Flash seems better than 1.5 - https://deepmind.google/technologies/gemini/flash/
> and every single week the results were slightly different.
This is one of the reasons why open source offline models will always be part of the solution, if not the whole solution.
Inconsistency comes from scaling - if you are optimizing your infra to be cos effective you will arrive at same tradeoffs. Not saying it's not nice to be able to make some of those decisions on your own - but if you're picking LLMs for simplicity - we are years away from running your own being in the same league for most people.
1 reply →
At temperature zero, if you're using the same API/model, this really should not be the case. None of the big players update their APIs without some name / version change.
This isn't really true unfortunately -- mixture of experts routing seems to suffer from batch non-determinism. No one has stated publicly exactly why this is, but you can easily replicate the behavior yourself or find bug reports / discussion with a bit of searching. The outcome and observed behavior of the major closed-weight LLM APIs is that a temperature of zero no longer corresponds to deterministic greedy sampling.
22 replies →
Quantized floating point math can, under certain scenarios, be non-associative.
When you combine that fact with being part of a diverse batch of requests over an MoE model, outputs are non-deterministic.
That’s why you have azure openAI APIs which give a lot more consistency
Wait isn't there atleast a two step process here one is semantic segmentation followed by a method like texttract for text - to avoid hallucinations?
One cannot possibly say that "Text extracted by a multimodal model cannot hallucinate"?
> accuracy was like 96% of that of the vendor and price was significantly cheaper.
I would like to know how this 96% was tested. If you use a human to do random sample based testing, well how do you adjust the random sample for variations in distribution of errors that vary like a small set of documents could have 90% of the errors and yet they are only 1% of the docs?
One thing people always forget about traditional OCR providers (azure, tesseract, aws textract, etc.) is that they're ~85% accurate.
They are all probabilistic. You literally get back characters + confidence intervals. So when textract gives you back incorrect characters, is that a hallucination?
I'm the founder of https://doctly.ai, also pdf extraction.
The hallucination in LLM extraction is much more subtle as it will rewrite full sentences sometimes. It is much harder to spot when reading the document and sounds very plausible.
We're currently working on a version where we send the document to two different LLMs, and use a 3rd if they don't match to increase confidence. That way you have the option of trading compute and cost for accuracy.
2 replies →
It’s a question of scale. When a traditional OCR system makes an error, it’s confined to a relatively small part of the overall text. (Think of “Plastics” becoming “PIastics”.) When a LLM hallucinates, there is no limit to how much text can be made up. Entire sentences can be rewritten because the model thinks they’re more plausible than the sentences that were actually printed. And because the bias is always toward plausibility, it’s an especially insidious problem.
1 reply →
The difference is the kind of hallucinations you get.
Traditional OCR is more likely to skip characters, or replace them with similar -looking ones, so you often get AL or A1 instead of AI for example. In other words, traditional spelling mistakes. LLMs can do anything from hallucinating new paragraphs to slightly changing the meaning of a sentence. The text is still grammatically correct, it makes sense in the context, except that it's not what the document actually said.
I once gave it a hand-written list of words and their definitions and asked it to turn that into flashcards (a json array with "word" and "definition"). Traditional OCR struggled with this text, the results were extremely low-quality, badly formatted but still somewhat understandable. The few LLMs I've tried either straight up refused to do it, or gave me the correct list of words, but entirely hallucinated the definitions.
> You literally get back characters + confidence intervals.
Oh god, I wish speech to text engines would colour code the whole thing like a heat map to focus your attention to review where it may have over-enthusiastically guessed at what was said.
You no knot.
1 reply →
I know nothing about OCR providers. It seems like OCR failure would result in gibberish or awkward wording that might be easy to spot. Doesn't the LLM failure mode assert made up truths eloquently that are more difficult to spot?
> is that they're ~85% accurate.
Speaking from experience, you need to double check "I" and "l" and "1" "0" and "O" all the time, accuracy seems to depend on the font and some other factors.
have a util script I use locally to copy some token values out of screenshots from a VMWare client (long story) and I have to manually adjust 9/times.
How relevant that is or isn't depends on the use case.
For an OCR company I imagine it is unconscionable to do this because if you would say OCR for an Oral History project for a library and you made hallucination errors, well you've replaced facts with fiction. Rewriting history? What the actual F.
Probaly totally fine for a "fintech" (Crypto?) though. Most of them are just burning VC money anyway. Maybe a lucky customer gets a windfall because Gemini added some zeros.
1 reply →
Normal OCR (like Tesseract) can be wrong as well (and IMO this happens frequently). It won’t hallucinate/straight make shit up like an LLM, but a human needs to review OCR results if the workload requires accuracy. Even across multiple runs of the same image an OCR can give different results (in some scenarios). No OCR system is perfectly accurate, they all use some kind of machine learning/floating point/potentially nondeterministic tech.
Can confirm using gemini, some figure numbers were hallucinated. I had to cross-check each row to make sure data extracted is correct.
use different models to extract the page and cross check against each other. generally reduces issues alot
Wouldn’t the temperature on something like OCR be very low. You want the same result every time. Isn’t some part of hallucination the randomness of temperature?
I can imagine reducing temp too much will lead to garbage results in situations where glyphs are unreadable.
2 replies →
The LLM's are near perfect (maybe parsing I instead of 1) - if you're using the outputs in the context of RAG, your errors are likely much much higher in the other parts of your system. Spending a ton of time and money chasing 9's when 99% of your system's errors have totally different root causes seems like a bad use of time (unless they're not).
This sounds extremely like my old tax accounting job. OCR existed and "worked" but it was faster to just enter the numbers manually than fix all the errors.
Also, the real solution to the problem should have been for the IRS to just pre-fill tax returns with all the accounting data that they obviously already have. But that would require the government to care.
Germany (not exactly the cradle of digitalization) already auto-fills salary tax fields with data from the employer.
They finally made filing free.
So, maybe this century?
Check again, Elon and his Doge team killed that.
2 replies →
This is a big aha moment for me.
If Gemini can do semantic chunking at the same time as extraction, all for so cheap and with nearly perfect accuracy, and without brittle prompting incantation magic, this is huge.
Could it do exactly the same with a web page? Would this replace something like beautiful soup?
I don't know exactly how or what it's doing behind the scenes, but I've been massively impressed with the results Gemini's Deep Research mode has generated, including both traditional LLM freeform & bulleted output, but also tabular data that had to come from somewhere. I haven't tried cross-checking for accuracy but the reports do come with linked sources; my current estimation is that they're at least as good as a typical analyst at a consulting firm would create as a first draft.
If I used Gemini 2.0 for extraction and chunking to feed into a RAG that I maintain on my local network, then what sort of locally-hosted LLM would I need to gain meaningful insights from my knowledge base? Would a 13B parameter model be sufficient?
Ypur lovalodel has littleore to do but stitch the already meaningzl pieces together.
The pre-step, chunking and semantic understanding is all that counts.
Do you get meaningful insights with current RAG solutions?
1 reply →
Small point but is it doing semantic chunking, or loading the entire pdf into context? I've heard mixed results on semantic chunking.
It loads the entire PDF into context, but then it would be my job to chunk the output for RAG, and just doing arbitrary fixed-size blocks, or breaking on sentences or paragraphs is not ideal.
So I can ask Gemini to return chunks of variable size, where each chunk is a one complete idea or concept, without arbitrarily chopping a logical semantic segment into multiple chunks.
7 replies →
It's cheap now because Google is subsidizing it, no?
Spoiler: every model is deeply, deeply subsidized. At least Google's is subsidized by a real business with revenue, not VC's staring at the clock.
1 reply →
This is great, I just want to highlight out how nuts it is that we have spun up whole industries around extracting text that was typically printed from a computer, back into a computer.
There should be laws that mandates that financial information be provided in a sensible format: even Office Open XML would be better than this insanity. Then we can redirect all this wasted effort into digging ditches and filling them back in again.
I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.
This is giving me hope that it's possible.
(from the gemini team) we're working on it! semantic chunking & extraction will definitely be possible in the coming months.
>>I've been fighting trying to chunk SEC filings properly, specifically surrounding the strange and inconsistent tabular formats present in company filings.
For this specific use case you can also try edgartools[1] which is a library that was relatively recently released that ingests SEC submissions and filings. They don't use OCR but (from what I can tell) directly parse the XBRL documents submitted by companies and stored in EDGAR, if they exist.
[1] https://github.com/dgunning/edgartools
I'll definitely be looking into this, thanks for the recommendation! Been playing around with it this afternoon and it's very promising.
If you'd kindly tl;dr the chunking strategies you have tried and what works best, I'd love to hear.
isn't everyone on iXBRL now? Or are you struggling with historical filings?
XBRL is what I'm using currently, but it's still kind of a mess (maybe I'm just bad at it) for some of the non-standard information that isn't properly tagged.
How do today’s LLM’s like Gemini compare with the Document Understanding services google/aws/azure have offered for a few years, particularly when dealing with known forms? I think Google’s is Document AI.
I've found the highest accuracy solution is to OCR with one of the dedicated models then feed that text and the original image into an LLM with a prompt like:
"Correct errors in this OCR transcription".
How does it behave if the body of text is offensive or what if it is talking about a recipe to purify UF-6 gas at home? Will it stop doing what it is doing and enter lecturing mode?
I am asking not to be cynical but because of my limited experience with using LLMs for any task that may operate on offensive or unknown input seems to get triggered by all sorts of unpredictable moral judgements and dragged into generating not the output I wanted, at all.
If I am asking this black box to give me a JSON output containing keywords for a certain text, if it happens to be offensive, it refuses to do that.
How does one tackle with that?
5 replies →
This is what we do today. Have you tried it against Gemini 2.0?
member of the gemini team here -- personally, i'd recommend directly using gemini vs the document understanding services for OCR & general docs understanding tasks. From our internal evals gemini is now stronger than these solutions and is only going to get much better (higher precision, lower hallucination rates) from here.
Could we connect offline about using Gemini instead of the doc ai custom extractor we currently use in production?
This sounds amazing & I'd love your input on our specific use case.
joelatoutboundin.com
GCP's Document AI service is now literally just a UI layer specific to document parsing use-cases back by Gemini models. When we realized that we dumped it and just use Gemini directly.
Your OCR vendor would be smart to replace their own system with Gemini.
They will, and they'll still have a solid product to sell, because their value proposition isn't accurate OCR per se, but putting an SLA on it.
Reaching reliability with LLM OCR might involve some combination of multiple LLMs (and keeping track of how they change), perhaps mixed with old-school algorithms, and random sample reviews by humans. They can tune this pipeline however they need at their leisure to eke out extra accuracy, and then put written guarantees on top, and still be cheaper for you long-term.
With “Next generation, extremely sophisticated AI” to be precise, I wait say. ;)
Marketing joke aside, maybe a hybrid approach could serve the vendor well. Best of both worlds if it reaps benefits or even have a look at hugging face for even more specialized aka better LLMs.
I work in financial data and our customers would not accept 96% accuracy in the data points we supply. Maybe 99.96%.
For most use cases in financial services, accurate data is very important.
so, what solution are you using to extract data with 99.96% accuracy?
I'm curious to hear about your experience with this. Which solution were you using before (the one that took 12 minutes)? If it was a self-hosted solution, what hardware were you using? How does Gemini handle PDFs with an unknown schema, and how does it compare to other general PDF parsing tools like Amazon Textract or Azure Document Intelligence? In my initial test, tables and checkboxes weren't well recognized.
> For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair".
I'm actually somewhat surprised Gemini didn't guess from context that LLC is much more likely?
I guess the OCR subsystem is intentionally conservative? (Though I'm sure you could do a second step on your end, take the output from the conservative OCR pass, and sent it through Gemini and ask it to flag potential OCR problems? I bet that would flag most of them with very few false positives and false negatives.)
Where I work we've had great success at using LLMs to OCR paper documents that look like
https://static.foxnews.com/foxnews.com/content/uploads/2023/...
but were often written with typewriters long ago to get nice structured tabular output. Deals with text being split across lines and across pages just fine.
How about the comparison with traditional proprietary on premise software like ONMIPage or ABBYY or those listed below: https://en.wikipedia.org/wiki/Comparison_of_optical_characte...
It is cheaper now, but I wonder if it will continue to be cheaper when companies like Google and OpenAI decide they want to make a profit off of AI, instead of pouring billions of dollars of investment funds into it. By the time that happens, many of the specialized service providers will be out of business and Google will be free to jack up the price.
I use Claude through OpenRouter (with Aider), and was pretty amazed to see that it routes the requests during the same session almost round-robin through Amazon Bedrock, sometimes through Google Vertex, sometimes through Anthropic themselves, all of course using the same underlying model.
Literally whoever has the cheapest compute.
With the speed that AI models are improving these days, it seems like the 'moat' of a better model is only a few months before it is commoditized and goes to the cheapest provider.
What are the pdfs containing?
I’ve been wanting to build a system that ingests pdf reports that reference other types of data like images, csv, etc. that can also be ingested to ultimately build an analytics database from the stack of unsorted data AB’s meta data but I have not found any time to do anything like that yet. What kind of tooling do you use to build your data pipelines?
It's great to hear it's this good, and it makes sense since Google has had several years of experience creating document-type-specific OCR extractors as components of their Document AI product in Cloud. What most heartening is to hear that the legwork they did for that set of solutions has made it into Gemini for consumers (and businesses).
Successful document processing vendors to use LLMs already. I know this at least of klippa. They have (apparently) fine-tuned models, prompts etc. The biggest issue with using LLMs directly is error handling, validation and "parameter drift"/randomness. This is the typical "I'll build it myself but worse" thing
I'm interested to hear what your experience has been dealing with optional data. For example if the input pdf has fields which are sometimes not populated or nonexistent, is Gemini smart enough to leave those fields blank in the output schema? Usually the LLM tries to please you and makes up values here.
You could ingest them with AWS Textract and have predictability and formatting in the format of your choice. Using LLMs for this is lazy and generates unpredictable and non-deterministic results.
Did you try other vision models such as ChatGPT and Grok? I'm doing something similar but struggled to find good comparisons in between the vision models in terms OCR and document understanding.
If the documents have the same format, maybe you could include an example document in the prompt, so the boilerplate stuff (like LLC) gets handled properly.
You could probably take this a step further and pipe the OCR'ed text into Claude 3.5 Sonnet and get it to fix any OCR errors
What if you prompt Gemini that mistaking LLC for IIC is a common mistake? Will Gemini auto correct it?
With lower temperature, it seems to work okay for me.
A _killer_ awesome thing it does too is allow code specification in the config instead of through repeated attempts at prompts.
Just to make sure: you are talking about your experiences with Gemini 1.5 Flash here, right?
Hi! Any guesstimate for pages/minute from your Gemini OCR experience? Thanks!
So are you mostly processing PDFs with data? Or PDFs with just text, or images, graphs?
Not the parent, but we process PDFs with text, tables, diagrams. Works well if the schema is properly defined.
Is privacy a concern?
Why would it be? Their only concern is IPO.
In fintech I'd suspect the PDFs are public knowledge
What hardware are you using to run it?
The Gemini model isn't open so it does not matter what hardware you have. You might have confused Gemini with Gemma.
OK, I see, pity. I'm interested in similar applications but in contexts where the material is proprietary and might contain PII.
“LLC” to “IIC” is one thing. But wouldn’t that also make it just as easy to to mistake something like “$100” for “$700”?
Out of interest, did you parse into any sort of defined schema/structure?
Parent literally said so …
> Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.
[dead]
The Gemini api has a customer noncompete, so it’s not an option for AI, what are you working on that doesn’t compete with AI?
You do realize most people aren't working on AI, right?
Also, OP mentioned fintech at the outset.
what doesn't compete with ai?