Comment by jeswin

5 months ago

If Pulse (which is a competing product, the premise of which is threatened by both closed and open models) wants to dispute the post earlier this week, it should provide samples which fail in Claude and Gemini. The image [1] in the post is low-resolution and fuzzy. Claude's user manual specifically says: "Images uploaded on Claude.ai can be up to 30MB, and up to 8000x8000 pixels. We recommend avoiding small or low resolution images where possible."

> We have hundreds of examples like this queued up, so let us know if you want some more!

Link to it then, let people verify.

I've pushed a lot of financial tables through Claude, and it gives remarkable accuracy (99%+) when the text size is legible to a mid-40s person like me. Gpt-4o is far less accurate.

[1]: https://cdn.prod.website-files.com/6707c5683ddae1a50202bac6/...

9 comments

jeswin

sgc 5 months ago

99%+ is terrible in the OCR world. 99.8%+ on first pass, and 99.99%+ (1/10k characters error) at the end of the process - which includes human reviewers in the loop - is ok, but the goal is higher fidelity than that. If we are throwing billions at the problem, I would expect at least another 9 on that.

rahimnathwani 5 months ago
99.8%+ on first pass
Even with the best OCR, and high resolution scans, you might not get this due to:
- the quality of the original paper documents, and
- the language
I have non-English documents for which I'd love to have 99% accuracy!
- sgc 5 months ago
  
  Language is often solvable by better dictionaries. I have been forced to make my own dictionaries in the past, that led to similar error rates as more mainstream languages like English. If you are talking about another alphabet like Cyrillic or Arabic etc, that is another problem.
  
  1 reply →

davedx 5 months ago

Ha hi Jeswin! I was itching to reply to this post too, I wonder why…

jeswin 5 months ago

Dave! Our sample sizes were large enough, and tables complex enough to opine on this.
I suppose Gemini or Claude could fail with scans or handwritten pages. But that's a smaller (and different) set of use cases than just OCR. Most PDFs (in healthcare, financial services, insurance) are digital.

bambax 5 months ago

Using that image and the following prompt on Gemini 2.0 Flash "please do ocr of the attached file and output ascii following the layout of the original as faithfully as possible" outputs something that isn't bad but not perfect:

  PI tno Name             Time            3.5 km   18 C   (cont.)
  MEN B (39)                                                  3(34)         4(52)         5(53)         6(54)         7(55)         
  8(40)         9(57)
                                                               12(60)        13(61)        14(62)        15(63)        16(47)        
  17(48)       18(100)
                                                  1(51)         2(33)
                                                  10(58)        11(59)

The first column is offset vertically which mixes up information and is wrong.

I'm building a traditional OCR pipeline (for which I'm looking for beta testers! ;-) and this is what it outputs:

  PI      tno Name                       Time
  
  MEN   B (39)                                                         3.5 km       18 C          (cont.)
                                                   1 (51)                  2 (33)                 3 (34)                  4 (52)                  5   (53)                  6 (54)                  7 (55)                  8 (40)                 9 (57)
                                                  10 (58)                 11 (59)                12 (60)                 13 (61)                 14 (62)                 15 (63)                16 (47)                 17 (48)                18 (100)
                                                  Finish
  
  13     425  Peter  Hodkinson          11:40       0:48   +0: 06 (21)      1:29  +0: 13 (28)      1:58   +0: 13 (24)      2:44   +0: 18 (23)      3:38   +0: 20 (19)     4:28    +0: 22 (18)     5:05   +0: 23 (17)      5:36   +0: 26 (17)      6:19   +0: 29 (19)
              Great  Britain                        0:48   +0: 06 (21)      0:41  +0: 09 (30)      0:29   +0: 01 (4)       0:46   +0: 07 (22)      0:54   +0: 02 (5)      0:50    +0: 03 (7)      0:37   +0: 02 (10)      0:31   +0: 03 (11)      0:43   +0: 05 (20)
                                                    6:47   +0: 28 (17)     7:02   +0: 29 (17)      8:21   +0: 38 (16)      8:41   +0: 39 (16)      9:00   +0: 41 (16)     9:13    +0: 42 (16)     9:43   +0: 42 (16)     10:36   +0: 43 (14)     11:32   +0: 41 (13)
                                                    0:28   +0: 02 (8)      0:15   +0: 01 (4)       1:19   +0: 11 (16)      0:20   +0: 03 (15)      0:19   +0: 02 (4)      0:13   +0: 02 (11)      0:30   +0: 01 (2)       0:53   +0: 01 (3)       0:56    0:00  (1)
                                                   11:40   +0: 40 (13)
                                                    0:08   +0: 00 (8)

(edit: line wrap messes it all up... still I think my version is better ;-)

jeswin 5 months ago

I usually say something like: ".. output it as hierarchical json". For better accuracy, we can run the output through another model.
Again, that image is fuzzy. If the argument is that these generic models don't work well with scans or handwritten content, I can perhaps agree with that. But that's a much smaller subset of PDFs.