Comment by simianwords

3 hours ago

They don't use tools. Like the 4th time you ignored this on purpose. That was not part of the challenge.

1 comment

simianwords

GPT-5.4 gets 82.7% on Browsecomp (a benchmark specifically testing tool use), which is a hallucination rate of 17.3%, on questions like "Give me the title of the scientific paper published in the EMNLP conference between 2018-2023 where the first author did their undergrad at Dartmouth College and the fourth author did their undergrad at University of Pennsylvania."

Since the goalposts have been moved to include effort, I'm compelled to say I found this while waiting in line at Starbucks, 5 mins tops. Probably GPT-5.4 could have found this too, though it lies > 1/6 the time, so one could be forgiven for not wanting to risk it.

https://llm-stats.com/benchmarks/browsecomp

https://openai.com/index/browsecomp/