Comment by throwaway0123_5

5 months ago

> yet o1 with high reasoning could only solve 16.5% of tasks formatted this way.

48.5% with pass@7 though, and presumably o3 would do better... they don't report the inference costs but I'd be shocked if they weren't substantially less than the payouts. I think it is pretty clear that there is real economic value here, and it does make me nervous for the future of the profession, moreso than any prior benchmark.

I agree it isn't perfect. Only tests TS/JS and the vast majority of the tasks are front-end, still none of the mainstream software engineering benchmarks test anything but JS/Python/sometimes Java.

> Turns out, it's worse than Claude Sonnet when it comes to coding.

This was an interesting takeaway for me too. At first I thought that it suggested reasoning models mostly only help with small-scale, well-defined reasoning tasks, but they report o1's pass@1 going from 9.3% at low reasoning effort to 16.5% with high reasoning effort, so I don't think that can be the case.

1 comment

throwaway0123_5

Bjorkbat 5 months ago

Yeah, I saw the pass@7 figure as well, and I'm not sure what to make of it. On the one hand, solving nearly half of all tasks is impressive. On the other hand, a machine that might do something correctly if you give it 7 attempts isn't particularly enjoyable to use.