Comment by scrollop
7 hours ago
I'd recommend carefully looking at a few benchmarks (even though generally relying on benchmarks is problematic)
https://artificialanalysis.ai/evaluations/omniscience
Esp check the Hallucination rate for Deepseek - it's not good.
> Esp check the Hallucination rate for Deepseek - it's not good.
For strongly-typed coding tasks - and I imagine other tasks that have cheap validity checks: agentic harnesses and thinking tokens are an effective foil against hallucinations, at the expense of time. If a model hallucinates an API, compilation will fail and the error fed back into the machine so it can try again, in a two-steps-forward-one-step-back dance that is unreasonably effective. Given the price delta, it is often more cost effective to let the weaker model spiral towards a solution with many "Oh, wait..." turns