Comment by vikp

5 months ago

Marker (https://www.github.com/VikParuchuri/marker) works kind of like this. It uses a layout model to identify blocks and processes each one separately. The internal format is a tree of blocks, which have arbitrary fields, but can all render to html. It can write out to json, html, or markdown.

I integrated gemini recently to improve accuracy in certain blocks like tables. (get initial text, then pass to gemini to refine) Marker alone works about as well as gemini alone, but together they benchmark much better.

5 comments

vikp

llm_trw 5 months ago

I used sxml [0] unironically in this project extensively.

The rendering step for reports that humans got to see was a call to pandoc after the sxml was rendered to markdown - look ma we support powerpoint! - but it also allowed us to easily convert to whatever insane markup a given large (or small) language model worked best with on the fly.

[0] https://en.wikipedia.org/wiki/SXML

cma 5 months ago

Why process separately, if there are ink smudges, photocopier glitches, etc. wouldn't it guess some stuff better from richer context, like acronyms in rows used across the other tables?

hackernewds 5 months ago

It's funny you astroturf your own project in a thread where another is presenting tangential info about their own

alemos 5 months ago

what does marker add on top of docling?

vikp 5 months ago

Docling is a great project, happy to see more people building in the space.

Marker output will be higher quality than docling output across most doc types, especially with the --use_llm flag. A few specific things we do differently:

  - We have hybrid mode with gemini that merges tables across pages, improves quality on forms, etc.
  - we run an ordering model, so ordering is better for docs where the PDF orde ris bad
  - OCR is a lot better, we train our own model, surya - https://github.com/VikParuchuri/surya
  - References and links
  - Better equation conversion (soon including inline)