← Back to context

Comment by alok-g

1 year ago

Looking at the sample documents, this seems more focused on tables and structured data extraction and not long-form texts. The ground truth JSON has so much less information than the original document image. I would love to see a similar benchmark for full contents including long-form text and tables.

Indeed, from their conclusions:

> They [VLMs] are generally more capable of "looking past the noise" of scan lines, creases, watermarks. Traditional models tend to outperform on high-density pages (textbooks, research papers) as well as common document formats like tax forms.

Which is a bit confusing? Did they test that or what? It doesn't seem that way from their limited dataset.