← Back to context

Comment by BugsJustFindMe

7 hours ago

Can you feed these to ChatGPT and tell me what it says they say?

https://imgur.com/a/CDU6Lgs

It gets them wrong for me, but maybe it will get them right for you. Maybe you're better at prompting or have access to a better model or something.

Eh, I was talking about OCR'ing modern English cursive handwriting, not translating medieval script written in a dead language. It seems reasonable to expect specialized models to be used for this type of work.

Still, here's the first one, via Gemini 2.0 experimental: https://i.imgur.com/HtnwfHp.png

How does the response look? Did it correctly identify the language as Old French, at least? Even if 100% made up, which I have a feeling it is, it's a more credible (not to mention creative) attempt than most non-specialists would come up with.

o1-pro, on the other hand, completely shat the bed: https://i.imgur.com/mivdjkA.png I haven't seen it fail like that in a LONG time, so good job, I guess. :) I resubmitted it by uploading the .jpg directly, and it mumbled something about a "Problem generating the response."

Second image:

Gemini 2.0 seemed to have more trouble with this one: https://i.imgur.com/oEktMP6.png

o1-pro gave another error message, but 4o did pretty well from what I can tell (agree/disagree?): https://i.imgur.com/7iR1y7U.png I thought it was interesting that it got the date wrong, as '1682' is pretty easy to make out compared to much of the text.

In summary, I think you broke o1-pro.

  • > Did it correctly identify the language as Old French, at least

    Yes! But that's the easy part. :)

    > I was talking about OCR'ing modern English cursive handwriting

    Yeah, see, I think that's a very narrow expectation. Archive paleography is substantially broader than that. I'm not saying that the tools are useless, but they're often still not better than humans directing focused care and attention.

    > o1-pro, on the other hand, completely shat the bed

    The result is absolutely hilarious though! So kudos to the model for making me laugh at least.

    > 4o did pretty well

    It is indeed pretty good and very impressive as a technological feat. The big problems I guess are:

    1) Pretty good isn't necessarily good enough.

    2) If one machine gets it right and one machine gets it wrong, can a machine reconcile them? Or must we again recruit humans?

    3) If a machine seems to get a lot right but also clearly makes important factual errors in ways where a human looks and says "how could you possibly get this part wrong, of all things?" (like the year), how much do we trust and rely on it?

    • The technique of pitting one model against another is usually pretty effective in my experience. If Gemini 2.0 Advanced and o1-pro agree on something, you can usually take it to the bank. If they don't, that's when human intervention is necessary, given the lack of additional first-rank models to query. (Edit: 1682 versus 1692 being a great example of something that a tiebreaker model could handle.)

      It seems likely that a mixture-of-models approach like this will be a good thing to formalize at some level. Using appropriately-trained models to begin with seems even more important, though, and I can't agree that this type of content is relevant when discussing straightforward OCR tasks on modern languages.

      1 reply →