Comment by ritvikpandey21

1 year ago

ritvik here from pulse. everyone’s pretty much made the right points here, but wanted to emphasize that due to the llm architecture, they predict “the most probable text string” that corresponds to the embedding, not necessarily the exact text. this non-deterministicness is awful for customers deploying in production and a lot of our customers complained about this to us initially. the best approach is to build a sort-of “agent”-based VLM x traditional layout segmentation/reading order algos, which is what we’ve done and are continuing to do.

we have a technical blog on this exact phenomena coming out in the next couple days, will attach it here when it’s out!