Comment by scrollop

7 hours ago

I'd recommend carefully looking at a few benchmarks (even though generally relying on benchmarks is problematic)

https://artificialanalysis.ai/evaluations/omniscience

Esp check the Hallucination rate for Deepseek - it's not good.

1 comment

scrollop

> Esp check the Hallucination rate for Deepseek - it's not good.

For strongly-typed coding tasks - and I imagine other tasks that have cheap validity checks: agentic harnesses and thinking tokens are an effective foil against hallucinations, at the expense of time. If a model hallucinates an API, compilation will fail and the error fed back into the machine so it can try again, in a two-steps-forward-one-step-back dance that is unreasonably effective. Given the price delta, it is often more cost effective to let the weaker model spiral towards a solution with many "Oh, wait..." turns