← Back to context

Comment by simianwords

2 hours ago

the latest top reported agentic LLMs score about 83–87%, versus an original human baseline of about 25.3% end to end, so today’s best systems appear to outperform humans by roughly 58–62 percentage points, or about 3.3–3.4×

So according to your own benchmark LLMs hallucinate much less than humans and report way higher accuracy.

Do you agree to be more skeptical of humans than LLMs on these tasks?

1. Irrelevant. I've delivered example after example of your fave model bullshitting. You should've bitten the bullet long ago. Honestly I'm disappointed; I've seen you in a lot of AI threads and assumed you'd be good to talk to on this, but you've moved the goalposts over and over again rather than engage in good faith. Anyone reading this thread (god bless them) can see you're plainly not objective here, thus calling into question your advocacy everywhere.

2. Humans will say "I don't know". The problem with hallucinations isn't that they're wrong, it's that there's no way to know they're wrong without being an expert or doing everything yourself, which undermines much of the reason for using an LLM--it certainly undermines their companies' valuations. You're conflating human failure ("I don't know") with model bullshitting ("I do know"... but it's wrong), which I would've previously attributed to basic human fuzziness, but now that I know you're not objective I'm pretty sure it's just flailing debate tactics.

3. Users can't teach these services to be better. If I have a junior engineer making assumptions about an API, I can teach them to not do that, or fire them in favor of one that can. I can't do that with LLMs.

4. The humans they're testing against aren't experts. Tax law experts will beat LLMs at tax law, etc. Again another flailing debate tactic.

Predictably, I'm done with this thread. Feel free to reply if you want the last word.