Comment by jibuai
17 days ago
I've been working on something similar the past couple months. A few thoughts:
- A lot of natural chunk boundaries span multiple pages, so you need some 'sliding window' mechanism for the best accuracy.
- Passing the entire document hurts throughput too much due to the quadratic complexity of attention. Outputs are also much worse when you use too much context.
- Bounding boxes can be solved by first generating boxes using tradition OCR / layout recognition, then passing that data to the LLM. The LLM can then link it's outputs to the boxes. Unfortunately getting this reliable required a custom sampler so proprietary models like Gemini are out of the question.
No comments yet
Contribute on Hacker News ↗