Comment by rvnx

9 months ago

Author claims that the most likely is that there is Tesseract running behind ChatGPT-4v/o.

There is no way that this is Tesseract.

-> Tesseract accuracy is very low, it can barely do OCR on printed documents.

Shouldn't this theory be testable? The response time for an image of the same size should remain constant (assuming a generated response of constant size). You could then try to put an increasing amount of text inside of the image. If this text is fed to the LLM using OCR, the total amount of tokens grows. You should then be able to observe an increase in response time.

Even if tesseract accuracy is low, if the tesseract result in addition to the image is then passed to the LLM, it can result in a much more accurate OCR.

For example, GPT4 with some vision capability would be able to fill in the incorrect OCR with the additional word co-occurrence understanding.

I've tested this approach with purely text LLM to correct OCR mistakes and it works quite well.

Also note that in some newer OCR pipelines that don't involve LLMs, there is a vision component and then a text correcting model that is in some ways similar to some forms of spell check, which can further improve results.

  • you can tell that the OCR fails more in cases without natural language like with code/random characters. OAI seems to claim 4o is a fully end to end multimodal model, but we will never know for sure, we can't trust a single word OpenAI is saying.

I once uploaded a giant image to chatGPT, asking to transcribe it, but the request failed, and in the error message there was a reference to some python script related to tesseract stuff. Since then I'm 100% tesserract is used there in some aspect for text recognition.

Because no one knows how to prep the images. With the right file type and resolution I get under a single character error per 10 pages and it's been that good since the late 00s.

  • With handwriting? With mixed fonts? Tesseract requires heavy customization and extension to perform reasonably on these workloads. The off-the-shelf options from major cloud providers blow it out of the water.

    • Never had to use it with handwriting, mixed fonts and text where location carries semantic infirmation: absolutely.

  • How do you prep the images?

    • May hourly rate starts at $300. If you'd like to hire me you're more than welcome to. I've done this work for a number of companies in the past.

Agreed. Tesseract is not able to handle handwriting or text that is distorted well, e.g. colored text over an image background — to the point that it would hurt any downstream LLM trying to make sense of the contents. It won’t even pick out bounding boxes.

I doubt they are running an OCR model, but if they actually were it would likely be an in-house one trained with more modern techniques.