Comment by tedsanders
13 hours ago
We don't want hallucinations either, I promise you.
A few biased defenses:
- I'll note that this eval doesn't have web search enabled, but we train our models to use web search in ChatGPT, Codex, and our API. I'd be curious to see hallucination rates with web search on.
- This eval only measures binary attempted vs did not attempt, but doesn't really reward any sort of continuous hedging like "I think it's X, but to be honest I'm not sure."
- On the flip side, GPT-5.5 has the highest accuracy score.
- With any rate over 1% (whether 30% or 70%), you should be verifying anything important anyway.
- On our internal eval made from de-identified ChatGPT prompts that previously elicited hallucinations, we've actually been improving substantially from 5.2 to 5.4 to 5.5. So as always, progress depends on how you measure it.
- Models that ask more clarifying questions will do better on this eval, even if they are just as likely to hallucinate after the clarifying question.
Still, Anthropic has done a great job here and I hope we catch up to them on this eval in the future.
No comments yet
Contribute on Hacker News ↗