Comment by ilaksh
5 hours ago
The only one it doesn't win is SWE bench which it is significantly behind Claude Sonnet. You just can't take down Sonnet.
5 hours ago
The only one it doesn't win is SWE bench which it is significantly behind Claude Sonnet. You just can't take down Sonnet.
One percentage point is not significant, neither in the colloquial nor the scientific sense[1].
[1] Binomial formula gives a confidence interval of 3.7%, using p=0.77, N=500, confidence=95%
Codex has been much better than Sonnet for me.
On what types of tasks?