← Back to context

Comment by silvertaza

18 hours ago

Still huge hallucination rate, unfortunately at 86%. To compare, Opus sits at 36%.

Source: https://artificialanalysis.ai/models?omniscience=omniscience...

grok is 17%? And that's the lowest, most models are like 80%+?

While hallucination is probably closer to 100% depending on the question. This benchmark makes no sense.

  • > While hallucination is probably closer to 100% depending on the question.

    But the benchmark didn't ask those questions, and it seems grok is very well at saying it doesn't know the answer otherwise.

  • It makes sense. Grok is taught to answer the question, regardless of how explicit or extreme it is. These other models are taught to suppress any wrongthink. That's going to make it hard to answer things correctly. If you've been told to answer something incorrectly because it's wrong, then you'll have to make up an answer.

There's something off with this because Haiku should not be that good.

This indicates they want this behavior, they know the person asking the question probably doesn't understand the problem entirely (or why would they be asking), so they'd prefer a confident response, regardless of outcomes, because the point is to sell the technologies competency (and the perception thereof), not the capabilities, to a bunch of people that have no clue what they're talking about.

LLMs will ruin your product, have fun trusting a billionaires thinking machine they swear is capable of replacing your employees if you just pay them 75% of your labor budget.

  • We don't want hallucinations either, I promise you.

    A few biased defenses:

    - I'll note that this eval doesn't have web search enabled, but we train our models to use web search in ChatGPT, Codex, and our API. I'd be curious to see hallucination rates with web search on.

    - This eval only measures binary attempted vs did not attempt, but doesn't really reward any sort of continuous hedging like "I think it's X, but to be honest I'm not sure."

    - On the flip side, GPT-5.5 has the highest accuracy score.

    - With any rate over 1% (whether 30% or 70%), you should be verifying anything important anyway.

    - On our internal eval made from de-identified ChatGPT prompts that previously elicited hallucinations, we've actually been improving substantially from 5.2 to 5.4 to 5.5. So as always, progress depends on how you measure it.

    - Models that ask more clarifying questions will do better on this eval, even if they are just as likely to hallucinate after the clarifying question.

    Still, Anthropic has done a great job here and I hope we catch up to them on this eval in the future.

  • On ChatGPT 5.3 Plus subscription I find that long informal chats tend to reveal unsatisfactory answers and biases, at this point after 10 rounds of replies I end up having to correct it so much that it starts to agree with my initial arguments full circle. I don't see how this behavior is acceptable or safe for real work. Like are programmers and engineers using LLMs completely differently than I'm doing, because the underlying technology is fundamentally the same.

    • Totally agreed, this has been and will continue to be a problem for all existing models.

      > Like are programmers and engineers using LLMs completely differently than I'm doing

      No, but the complexity of the problem matters. Lots of engineers doing basic CRUD and prototyping overestimate the capabilities of LLMs.