Comment by lisa_coicadan
7 months ago
Great thread, we’ve seen the exact same pain points around working with large volumes of complex PDFs/Word docs.
At Retab.com, we focus on the “hard pre-RAG” layer: turning raw documents : including scanned reports, OCR messes, financial statements, or regulatory filings... into clean, structured, model-ready data.
Instead of relying on embeddings over noisy text chunks, we use schema-driven generation, multi-LLM consensus, and an evaluation UI to ensure output is accurate, complete, and explainable. No manual parsing, no hallucinations, just structured JSON (or any format you want), ready for retrieval, agents, or analytics.
We work with teams doing RAG on contracts, audits, earnings reports, etc.. anywhere that “close enough” isn’t good enough. Happy to run your hardest docs through Retab if you want to benchmark against WFGY or LlamaParse
What makes a PDF 'hard' in your mind?