← Back to context

Comment by scrollop

6 hours ago

I'd recommend carefully looking at a few benchmarks (even though generally relying on benchmarks is problematic)

https://artificialanalysis.ai/evaluations/omniscience

Esp check the Hallucination rate for Deepseek - it's not good.

> Esp check the Hallucination rate for Deepseek - it's not good.

For strongly-typed coding tasks - and I imagine other tasks that have cheap validity checks: agentic harnesses and thinking tokens are an effective foil against hallucinations, at the expense of time. If a model hallucinates an API, compilation will fail and the error fed back into the machine so it can try again, in a two-steps-forward-one-step-back dance that is unreasonably effective. Given the price delta, it is often more cost effective to let the weaker model spiral towards a solution with many "Oh, wait..." turns