Comment by camgunz

7 hours ago

GPT-5.4 gets 82.7% on Browsecomp (a benchmark specifically testing tool use), which is a hallucination rate of 17.3%, on questions like "Give me the title of the scientific paper published in the EMNLP conference between 2018-2023 where the first author did their undergrad at Dartmouth College and the fourth author did their undergrad at University of Pennsylvania."

Since the goalposts have been moved to include effort, I'm compelled to say I found this while waiting in line at Starbucks, 5 mins tops. Probably GPT-5.4 could have found this too, though it lies > 1/6 the time, so one could be forgiven for not wanting to risk it.

https://llm-stats.com/benchmarks/browsecomp

https://openai.com/index/browsecomp/

3 comments

camgunz

simianwords 5 hours ago

the latest top reported agentic LLMs score about 83–87%, versus an original human baseline of about 25.3% end to end, so today’s best systems appear to outperform humans by roughly 58–62 percentage points, or about 3.3–3.4×

So according to your own benchmark LLMs hallucinate much less than humans and report way higher accuracy.

Do you agree to be more skeptical of humans than LLMs on these tasks?

camgunz 4 hours ago
1. Irrelevant. I've delivered example after example of your fave model bullshitting. You should've bitten the bullet long ago. Honestly I'm disappointed; I've seen you in a lot of AI threads and assumed you'd be good to talk to on this, but you've moved the goalposts over and over again rather than engage in good faith. Anyone reading this thread (god bless them) can see you're plainly not objective here, thus calling into question your advocacy everywhere.
2. Humans will say "I don't know". The problem with hallucinations isn't that they're wrong, it's that there's no way to know they're wrong without being an expert or doing everything yourself, which undermines much of the reason for using an LLM--it certainly undermines their companies' valuations. You're conflating human failure ("I don't know") with model bullshitting ("I do know"... but it's wrong), which I would've previously attributed to basic human fuzziness, but now that I know you're not objective I'm pretty sure it's just flailing debate tactics.
3. Users can't teach these services to be better. If I have a junior engineer making assumptions about an API, I can teach them to not do that, or fire them in favor of one that can. I can't do that with LLMs.
4. The humans they're testing against aren't experts. Tax law experts will beat LLMs at tax law, etc. Again another flailing debate tactic.
Predictably, I'm done with this thread. Feel free to reply if you want the last word.
- simianwords 33 minutes ago
  
  This was my original point
  >I don't think calling AI a bullshit machine is correct. In spirit.
  That was always my goal post and I asked the challenge to get it to bullshit to drive a point across. You yourself said it is trivial.
  1. You came up with the horns question - I tried with the thinking model and it clearly understood that it was a joke and replied appropriately
  2. You came up with the assembly question - I tried it again with the thinking model and it gave the right answer again
  3. Now you gave up trying to make prompts by yourself because you realised that its in fact not trivial
  4. Then you started looking for benchmarks to show that it bullshits
  5. You picked a benchmark that doesn't allow tools (which was not my constraint)
  6. Then you picked a benchmark that does allow tools, and it turns out that it performs much better than humans
  7. Upon hearing this, you shifted to goal posts to say that "models don't know how to say I don't know and I can teach models etc etc"
  On the last part: There's a benchmark called SimpleQA which doesn't allow tools and allows for "I don't know" as an answer and GPT 5 still beats humans.
  I think you should reconsider thinking this "I don't think calling AI a bullshit machine is correct".