Comment by faxmeyourcode
4 hours ago
I had a hunch that opus 4.7 hedged more than other models - and it turns out it's true
model total_claims hedged_count hedged_pct
claude-opus-4-7 1000 451 45.1
sonar-pro 1000 391 39.1
gpt-5.4 1000 277 27.7
gemini-3-retrieval 1000 129 12.9
gemini-3-pro 1000 60 6.0
datasette query here
https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
This is in line with my observations and tests as well. Also supported by the distribution of the verdicts across the 4-buckets -- Gemini uses the middle buckets (Mostly True and Misleading) much less often - 6% combined for Gemini w/o search. And Opus uses them the most - 45% combined. Looks like Gemini is calibrated to be confident and Opus to be careful.