Comment by cedws

5 months ago

90% accuracy +/- 10%? What could that be useful for, that’s awfully low.

10 comments

cedws

> accuracy is measured with the Needleman-Wunsch algorithm

> Crucially, we’ve seen very few instances where specific numerical values are actually misread. This suggests that most of Gemini’s “errors” are superficial formatting choices rather than substantive inaccuracies. We attach examples of these failure cases below [1].

> Beyond table parsing, Gemini consistently delivers near-perfect accuracy across all other facets of PDF-to-markdown conversion.

That seems fairly useful to me, no? Maybe not for mission critical applications, but for a lot of use cases, this seems to be good enough. I'm excited to try these prompts on my own later.

schainks 5 months ago

This is "good enough" for Banks to use when doing due diligence. You'd be surprised how much noise is in the system with the current state of the art: algorithms/web scrapers and entire buildings of humans in places like India.

ai-christianson 5 months ago

It's certainly pretty useful for discovery/information filtering purposes. I.e. searching for signal in the noise if you have a large dataset.
jjtheblunt 5 months ago
due diligence of this sort?
https://en.wikipedia.org/wiki/Know_your_customer
- schainks 5 months ago
  
  No, I mean services like Bloomberg.
  KYC is an API you can pay for now. Works pretty well for the price, IIRC over 10k/month or something.

raunakchowdhuri 5 months ago

would encourage you to take a look at some of the real data here! https://huggingface.co/spaces/reducto/rd_table_bench

you'll find that most of the errors here are structural issues with the table or inability to parse some special characters. tables can get crazy!

serjester 5 months ago

Author here — measuring accuracy in table parsing is surprisingly challenging. Subtle, almost imperceptible differences in how a table is parsed may not affect the reader's understanding but can significantly impact benchmark performance. For all practical purposes, I'd say it's near perfect (also keep in mind the benchmark is on very challenging tables).

summerlight 5 months ago

I guess 90% is for "benchmark", which is typically tailored to be challenging to parse.

mattnewton 5 months ago

having seen some of these tables, I would guess that's probably above a layperson's score . Some are very complicated or just misleadingly structured.

MattDaEskimo 5 months ago

Switching from manual data entry to approval