Comment by bambax
14 days ago
Using that image and the following prompt on Gemini 2.0 Flash "please do ocr of the attached file and output ascii following the layout of the original as faithfully as possible" outputs something that isn't bad but not perfect:
PI tno Name Time 3.5 km 18 C (cont.)
MEN B (39) 3(34) 4(52) 5(53) 6(54) 7(55)
8(40) 9(57)
12(60) 13(61) 14(62) 15(63) 16(47)
17(48) 18(100)
1(51) 2(33)
10(58) 11(59)
The first column is offset vertically which mixes up information and is wrong.
I'm building a traditional OCR pipeline (for which I'm looking for beta testers! ;-) and this is what it outputs:
PI tno Name Time
MEN B (39) 3.5 km 18 C (cont.)
1 (51) 2 (33) 3 (34) 4 (52) 5 (53) 6 (54) 7 (55) 8 (40) 9 (57)
10 (58) 11 (59) 12 (60) 13 (61) 14 (62) 15 (63) 16 (47) 17 (48) 18 (100)
Finish
13 425 Peter Hodkinson 11:40 0:48 +0: 06 (21) 1:29 +0: 13 (28) 1:58 +0: 13 (24) 2:44 +0: 18 (23) 3:38 +0: 20 (19) 4:28 +0: 22 (18) 5:05 +0: 23 (17) 5:36 +0: 26 (17) 6:19 +0: 29 (19)
Great Britain 0:48 +0: 06 (21) 0:41 +0: 09 (30) 0:29 +0: 01 (4) 0:46 +0: 07 (22) 0:54 +0: 02 (5) 0:50 +0: 03 (7) 0:37 +0: 02 (10) 0:31 +0: 03 (11) 0:43 +0: 05 (20)
6:47 +0: 28 (17) 7:02 +0: 29 (17) 8:21 +0: 38 (16) 8:41 +0: 39 (16) 9:00 +0: 41 (16) 9:13 +0: 42 (16) 9:43 +0: 42 (16) 10:36 +0: 43 (14) 11:32 +0: 41 (13)
0:28 +0: 02 (8) 0:15 +0: 01 (4) 1:19 +0: 11 (16) 0:20 +0: 03 (15) 0:19 +0: 02 (4) 0:13 +0: 02 (11) 0:30 +0: 01 (2) 0:53 +0: 01 (3) 0:56 0:00 (1)
11:40 +0: 40 (13)
0:08 +0: 00 (8)
(edit: line wrap messes it all up... still I think my version is better ;-)
I usually say something like: ".. output it as hierarchical json". For better accuracy, we can run the output through another model.
Again, that image is fuzzy. If the argument is that these generic models don't work well with scans or handwritten content, I can perhaps agree with that. But that's a much smaller subset of PDFs.