← Back to context

Comment by camgunz

15 hours ago

I've got about 20 minutes in this; mostly I've been reading wallstreetbets at the Shake Shack bar in the Boston airport. I'm happy to post this over and over again until you engage w/ it:

> I found over 500 examples that fit your criteria.

They don't use tools. Like the 4th time you ignored this on purpose. That was not part of the challenge.

  • GPT-5.4 gets 82.7% on Browsecomp (a benchmark specifically testing tool use), which is a hallucination rate of 17.3%, on questions like "Give me the title of the scientific paper published in the EMNLP conference between 2018-2023 where the first author did their undergrad at Dartmouth College and the fourth author did their undergrad at University of Pennsylvania."

    Since the goalposts have been moved to include effort, I'm compelled to say I found this while waiting in line at Starbucks, 5 mins tops. Probably GPT-5.4 could have found this too, though it lies > 1/6 the time, so one could be forgiven for not wanting to risk it.

    https://llm-stats.com/benchmarks/browsecomp

    https://openai.com/index/browsecomp/

    • the latest top reported agentic LLMs score about 83–87%, versus an original human baseline of about 25.3% end to end, so today’s best systems appear to outperform humans by roughly 58–62 percentage points, or about 3.3–3.4×

      So according to your own benchmark LLMs hallucinate much less than humans and report way higher accuracy.

      Do you agree to be more skeptical of humans than LLMs on these tasks?

      1 reply →