Comment by faxmeyourcode

2 hours ago

I had a hunch that opus 4.7 hedged more than other models - and it turns out it's true

    model                 total_claims  hedged_count  hedged_pct
    claude-opus-4-7       1000          451           45.1
    sonar-pro             1000          391           39.1
    gpt-5.4               1000          277           27.7
    gemini-3-retrieval    1000          129           12.9
    gemini-3-pro          1000          60            6.0

datasette query here

https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

1 comment

faxmeyourcode

kostaj 2 hours ago

This is in line with my observations and tests as well. Also supported by the distribution of the verdicts across the 4-buckets -- Gemini uses the middle buckets (Mostly True and Misleading) much less often - 6% combined for Gemini w/o search. And Opus uses them the most - 45% combined. Looks like Gemini is calibrated to be confident and Opus to be careful.