Comment by daemonologist
9 months ago
Has anyone tried Kosmos [0] ? I came across it the other day and it looked shiny and interesting, but I haven't had a chance to put it to the test much yet.
[0] - https://github.com/microsoft/unilm/tree/master/kosmos-2.5
Okay, I got Kosmos-2.5 running - here's a mini review:
It's _extremely_ slow, about 30 seconds/page on an A10G. Maybe there's room to improve that, I don't know, but for now that's a problem.
The actual character recognition is superb, probably comparable to the big cloud offerings. On a sample of pages of moderately challenging typed text it was literally flawless aside from non-ascii characters.
It can do _neat_ handwriting with reasonable accuracy. This surprised me since there doesn't seem to be any handwriting in the training data (for Pix2Struct either). However, it will sometimes just skip handwriting.
The structured (markdown) output is sometimes impressive, occasionally a mess. I noticed two weaknesses in particular: it often starts a table as an HTML table and then switches to markdown, and it struggles to distinguish multi-column layouts unless they're pure book-like paragraphs or a table with clear gridlines. This is probably a result of sane/straightforward layouts from READMEs and scientific papers being most represented in the training data (the industry I'm in produces lots of documents with layouts that are wild even to a human).
One other thing: as a generative model it can and will go off the rails. One document I gave it with a lot of messy handwriting produced the typed header and then just 1500 lines of greater-than symbols. To be fair I couldn't read it either. While I didn't see it produce any valid-looking but "hallucinated" output, that's a possibility too.
It works really well for captioning. The few attempts I made at OCR failed miserably on CCTV images (camera label at top and datetime stamp on bottom).