Comment by ajcp

4 months ago

-> put Gemini 2.5 at the top of the pack

I have come to the same conclusion having built a workflow that has seen 10 million+ non-standardized PDFs (freight bill of ladings) with running evaluations, as well as against the initial "ground-truth" dataset of 1,000 PDFs.

Humans: ~65% accurate

Gemini 1.5: ~72% accurate

Gemini 2.0: ~88% accurate

Gemini 2.5: ~92%* accurate

*Funny enough we were getting a consistent 2% improvement with 2.5 over 2.0 (90% versus 88%) until as a lark we decided to just copy the same prompt 10x. Squeezed 2% more out of that one :D

2 comments

ajcp

simonw 4 months ago

Gemini 3.0 is rumored to drop any day now, will be very interesting to see the score that gets for your benchmark here.

ajcp 4 months ago

As long as the ergonomics with the SDK stay the same. Jumping up to a new model this far in is something I don't want to contemplate wrestling with honestly. When we were forced off of 1.5 to 2.0 we found that our context strategy had to be completely reworked to recover and see better returns.