Comment by ritvikpandey21

1 year ago

on raw text, LLM’s usually do not struggle. however, when you start processing low-fidelity images (receipt scans with stains, documents with marks all over it, bent corners/areas, rotated docs) these transcription issues become extremely noticeable. to your point about table extraction, i disagree — we’ve had many examples on complex nested tables where the model hallucinated digits, especially from documents with weird aspect ratios.

fully agree on the last point, the vit architecture will need some working on for this — microsoft’s been doing some excellent research on this lately

0 comments

ritvikpandey21

No comments yet

Contribute on Hacker News ↗