Comment by alok-g
1 year ago
Looking at the sample documents, this seems more focused on tables and structured data extraction and not long-form texts. The ground truth JSON has so much less information than the original document image. I would love to see a similar benchmark for full contents including long-form text and tables.
Indeed, from their conclusions:
> They [VLMs] are generally more capable of "looking past the noise" of scan lines, creases, watermarks. Traditional models tend to outperform on high-density pages (textbooks, research papers) as well as common document formats like tax forms.
Which is a bit confusing? Did they test that or what? It doesn't seem that way from their limited dataset.