Comment by silvertaza

18 hours ago

Still huge hallucination rate, unfortunately at 86%. To compare, Opus sits at 36%.

Source: https://artificialanalysis.ai/models?omniscience=omniscience...

17 comments

silvertaza

dubcanada 17 hours ago

grok is 17%? And that's the lowest, most models are like 80%+?

While hallucination is probably closer to 100% depending on the question. This benchmark makes no sense.

Jensson 10 hours ago

> While hallucination is probably closer to 100% depending on the question.
But the benchmark didn't ask those questions, and it seems grok is very well at saying it doesn't know the answer otherwise.
elAhmo 17 hours ago
No one serious uses grok.
- ajdegol 16 hours ago
  
  @grok is this true?
  
  2 replies →
- RALaBarge 15 hours ago
  
  YMMV but Grok 4.1 Fast can usually find via static analysis a few things that other models dont seem to catch with the same prompt
- d0gsg0w00f 11 hours ago
  
  Why not? Honest question.
MagicMoonlight 3 hours ago

It makes sense. Grok is taught to answer the question, regardless of how explicit or extreme it is. These other models are taught to suppress any wrongthink. That's going to make it hard to answer things correctly. If you've been told to answer something incorrectly because it's wrong, then you'll have to make up an answer.

simianwords 18 hours ago

There's something off with this because Haiku should not be that good.

camgunz 7 hours ago

Hallucination benchmarks accept "I don't know", which Haiku did at least a little. Here are other benchmarks corroborating: https://suprmind.ai/hub/ai-hallucination-rates-and-benchmark...
rattray 12 hours ago

I've been very curious about that too. I wonder if it's actually much better at admitting when it doesn't know something, because it thinks it's a "dumber model". But I haven't played with this at all myself.
jwpapi 17 hours ago

The hallucination benchmark is hallucinating

dakolli 17 hours ago

This indicates they want this behavior, they know the person asking the question probably doesn't understand the problem entirely (or why would they be asking), so they'd prefer a confident response, regardless of outcomes, because the point is to sell the technologies competency (and the perception thereof), not the capabilities, to a bunch of people that have no clue what they're talking about.

LLMs will ruin your product, have fun trusting a billionaires thinking machine they swear is capable of replacing your employees if you just pay them 75% of your labor budget.

tedsanders 13 hours ago

We don't want hallucinations either, I promise you.
A few biased defenses:
- I'll note that this eval doesn't have web search enabled, but we train our models to use web search in ChatGPT, Codex, and our API. I'd be curious to see hallucination rates with web search on.
- This eval only measures binary attempted vs did not attempt, but doesn't really reward any sort of continuous hedging like "I think it's X, but to be honest I'm not sure."
- On the flip side, GPT-5.5 has the highest accuracy score.
- With any rate over 1% (whether 30% or 70%), you should be verifying anything important anyway.
- On our internal eval made from de-identified ChatGPT prompts that previously elicited hallucinations, we've actually been improving substantially from 5.2 to 5.4 to 5.5. So as always, progress depends on how you measure it.
- Models that ask more clarifying questions will do better on this eval, even if they are just as likely to hallucinate after the clarifying question.
Still, Anthropic has done a great job here and I hope we catch up to them on this eval in the future.
calf 13 hours ago
On ChatGPT 5.3 Plus subscription I find that long informal chats tend to reveal unsatisfactory answers and biases, at this point after 10 rounds of replies I end up having to correct it so much that it starts to agree with my initial arguments full circle. I don't see how this behavior is acceptable or safe for real work. Like are programmers and engineers using LLMs completely differently than I'm doing, because the underlying technology is fundamentally the same.
- William_BB 7 hours ago
  
  Totally agreed, this has been and will continue to be a problem for all existing models.
  > Like are programmers and engineers using LLMs completely differently than I'm doing
  No, but the complexity of the problem matters. Lots of engineers doing basic CRUD and prototyping overestimate the capabilities of LLMs.