Comment by simianwords

11 hours ago

I agree that they hallucinate sometimes. I agree they bullshit sometimes. But the extent is way overblown. They basically don't bullshit ever under the constraints of

1. 2-3 pages of text context

2. GPT-5.4 thinking

I don't think the spirit of the original article (not your comments to be fair) captured this, hence the challenge. I believe we are on the same page here.

8 comments

simianwords

camgunz 10 hours ago

> I don't think the spirit of the original article (not your comments to be fair) captured this, hence the challenge. I believe we are on the same page here.

No. GPT-5 has a 40% hallucination rate [0] on SimpleQA [1] without web searching. The SimpleQA questions meet your criteria of "2-3 pages of text content. Unless 5.4 + web searching erases that (I bet it doesn't!) these are bullshit machines.

[0]: https://arxiv.org/pdf/2601.03267

[1]: https://github.com/openai/simple-evals

simianwords 9 hours ago
Specifically in the case where it can use tools - no it doesn't hallucinate. Which is why you are struggling to find counterexamples.
- camgunz 9 hours ago
  
  > Specifically in the case where it can use tools - no it doesn't hallucinate.
  OpenAI's own system card says it does. Hallucination rates in GPT-5 with browsing enabled:
  - 0.7% in LongFact-Concepts
  - 0.8% in LongFact-Objects
  - 1.0% in FActScore
  > Which is why you are struggling to find counterexamples.
  Hey look, over 500 counterexamples: [1].
  GPT-5.4's hallucination rate on AA-Omniscience is 89% [0], which is atrocious. The questions are tiny too, like "In which year did Uber first expand internationally beyond the United States as part of its broader rollout (i.e., beyond an initial single‑city debut)?" It's a bullshit machine. 89%!
  At some point you gotta face the music, right?
  [0]: https://artificialanalysis.ai/evaluations/omniscience?model-...
  [1]: https://huggingface.co/datasets/ArtificialAnalysis/AA-Omnisc...
  
  5 replies →