Comment by nilirl

1 day ago

One thing I've struggled with before is building a collection of data models based off of a collection of PDF forms.

I wanted to abstract away the PDF form building my own html form on top of a data model that can later be used to programmatically fill the PDF .

Since I had 100s of PDFs, I wanted an OCR+LLM pipeline to build a data model for each PDF. Unfortunately, OCR + LLM works ~90% of the time but sometimes fields are missed or mislabeled in the data model.

Does this sometimes get it wrong during programmatic filling? How do you deal with that?