Comment by pietz

3 months ago

We ran a small experiment internally on this and it looked like Gemini is better at handwriting recognition than I am. After seeing what it parsed, I was like "oh yeah, that's right". I do agree that instead of saying "Sorry, I can't read that" it just made up something.

I have a thought that whilst LLM providers can say "Sorry" - there is little incentive and it will expose the reality that they are not very accurate, nor can be properly measured. That said, there clearly are use cases where if the LLM can't a certain level of confidence it should refer to the user, rather than guessing.

  • This is actively being worked on my pretty much every major provider. It was the subject of that recent OpenAI paper on hallucinations. It's mostly caused by benchmarks that reward correct answers, but don't penalize bad answers more than simply not answering.

    E.g.

    Most current benchmarks have a scoring scheme of

    1 - Correct Answer 0 - No answer or incorrect answer

    But what they need is something more like

    1 - Correct Answer 0.25 - No answer 0 - Incorrect answer

    You need benchmarks (particularly those used in training) to incentivize the models to acknowledge when they're uncertain.