Comment by jasonjmcghee
1 year ago
A big takeaway for me is that Gemini Flash 2.0 is a great solution to OCR, considering accessibility, cost, accuracy, and speed.
It also has a 1M token context window, though from personal experience it seems to work better the smaller the context window is.
Seems like Google models have been slowly improving. It wasn't so long ago I completely dismissed them.
And from my personal experience with Gemini 2.0 flash Vs 2.0 pro is not even close
I had gemini 2.0 pro read my entire hand written, stain covered, half English, half french family cookbook perfectly first time
It's _crazy_ good. I had it output the whole thing in latex format to generate a printable document immediately too
I’m definitely not getting that takeaway. This wasn’t even an OCR benchmark: the task was structured data extraction, and deterministic metrics were set aside in favor of GPT-as-a-judge.
VLMs are every bit as susceptible to the (unsolved) hallucination problem as regular LLMs are. I would not use them to do OCR on anything important because the failure modes are totally unbounded (unlike regular OCR).
> This wasn’t even an OCR benchmark: the task was structured data extraction, and deterministic metrics were set aside in favor of GPT-as-a-judge.
Looks like they've got deterministic metrics to me: For each document they've got a ground truth set of JSON extracted data, and they use json-diff to calculate the fields that disagree.
There is GPT-4o in their evaluation pipeline - but only as a means of converting the OCRed document into their target JSON schema.
Also, what's strange is there's no free of paid OCR engine is added to the mix for the evaluation. Tessaract is built specially for OCR'in scanned documents, and it has a both traditional and neural network based modes, to boot.
> Also, what's strange is there's no free of paid OCR engine is added to the mix for the evaluation.
The article says they evaluated "Traditional OCR providers (Azure, AWS Textract, Google Document AI, etc.)"
Are those not paid OCR engines?
1 reply →
I'm wondering how gemini can OCR big image correctly with good quality. They charge for image as input ~250 tokens. Always the same no matter the size of the image you send. 250 tokens its ~200 words. Will OCR work if you send 4k image that has a lot of text in small font? What if page will have more than 200 words? Are google selling it at cost?