Comment by nilirl
1 day ago
One thing I've struggled with before is building a collection of data models based off of a collection of PDF forms.
I wanted to abstract away the PDF form building my own html form on top of a data model that can later be used to programmatically fill the PDF .
Since I had 100s of PDFs, I wanted an OCR+LLM pipeline to build a data model for each PDF. Unfortunately, OCR + LLM works ~90% of the time but sometimes fields are missed or mislabeled in the data model.
Does this sometimes get it wrong during programmatic filling? How do you deal with that?
[dead]