Comment by jerrygenser
9 months ago
Even if tesseract accuracy is low, if the tesseract result in addition to the image is then passed to the LLM, it can result in a much more accurate OCR.
For example, GPT4 with some vision capability would be able to fill in the incorrect OCR with the additional word co-occurrence understanding.
I've tested this approach with purely text LLM to correct OCR mistakes and it works quite well.
Also note that in some newer OCR pipelines that don't involve LLMs, there is a vision component and then a text correcting model that is in some ways similar to some forms of spell check, which can further improve results.
you can tell that the OCR fails more in cases without natural language like with code/random characters. OAI seems to claim 4o is a fully end to end multimodal model, but we will never know for sure, we can't trust a single word OpenAI is saying.