← Back to context

Comment by camgunz

15 hours ago

> I don't think the spirit of the original article (not your comments to be fair) captured this, hence the challenge. I believe we are on the same page here.

No. GPT-5 has a 40% hallucination rate [0] on SimpleQA [1] without web searching. The SimpleQA questions meet your criteria of "2-3 pages of text content. Unless 5.4 + web searching erases that (I bet it doesn't!) these are bullshit machines.

[0]: https://arxiv.org/pdf/2601.03267

[1]: https://github.com/openai/simple-evals

Specifically in the case where it can use tools - no it doesn't hallucinate. Which is why you are struggling to find counterexamples.

  • > Specifically in the case where it can use tools - no it doesn't hallucinate.

    OpenAI's own system card says it does. Hallucination rates in GPT-5 with browsing enabled:

    - 0.7% in LongFact-Concepts

    - 0.8% in LongFact-Objects

    - 1.0% in FActScore

    > Which is why you are struggling to find counterexamples.

    Hey look, over 500 counterexamples: [1].

    GPT-5.4's hallucination rate on AA-Omniscience is 89% [0], which is atrocious. The questions are tiny too, like "In which year did Uber first expand internationally beyond the United States as part of its broader rollout (i.e., beyond an initial single‑city debut)?" It's a bullshit machine. 89%!

    At some point you gotta face the music, right?

    [0]: https://artificialanalysis.ai/evaluations/omniscience?model-...

    [1]: https://huggingface.co/datasets/ArtificialAnalysis/AA-Omnisc...

    • You had to go all the way and find it in the benchmark results that specifically stress test this.

      You could not come up with a single one yourself. And you also linked an example where it was not allowed to use tools when I specifically said that it should be able to use tools. I'm not sure why are you present this as though it is a big gotcha.

      I think my main point pretty much stands.

      7 replies →