Comment by ipunchghosts

4 hours ago

I think ppl only care about how Claude or codex does.

4 comments

ipunchghosts

GPT-5.4 and Opus 4.7, specifically, agree between themselves on 65% of the claims - 95% CI 62–68%. I.e., in at least 35% of the claims, one of the two models is wrong under this 4-bucket rubric.

TaupeRanger 3 hours ago

but that's without internet search - everyone I know uses the models that search when they need to, and I'm sure GPT and Opus would agree on almost everything if 1) they searched when necessary, and 2) they were allowed to give context to their answers instead of being hamstrung to get specious "research" results.

spprashant 4 hours ago

Looks like they land at the average number of 67% disagreement.

airstrike 4 hours ago

I agree but the market is pricing way beyond that