Comment by Bjorkbat

1 year ago

I think that's a premature conclusion to take from this benchmark.

Something to keep in mind is that Expensify is kind of an anomaly in that it hires freelancers by creating a well-articulated Github issue and telling them to go solve that. This is about as ideal as you can hope to ask for when it comes to articulating requirements, and yet o1 with high reasoning could only solve 16.5% of tasks formatted this way.

Not to mention, these models perform a lot worse than their SWE-bench results would otherwise suggest.

Big picture, there's a funny trend when it comes to generative AI of inflated expectations that rapidly deflate once we use them in the real world. I still remember being a little bit freaked out by o1 when it came out because it scored so well on a number of benchmarks. Turns out, it's worse than Claude Sonnet when it comes to coding. Our expectations are consistently inflated by hype and benchmarks, but then once we use them in the real world we find out that they're not as great as the benchmarks would otherwise suggest.

Kind of feels like this is going to go on forever. A new model is announced, teased with crazy benchmark results, once people get their hands on it they're slightly underwhelmed by how it performs in the real world.

6 comments

Bjorkbat

throwaway0123_5 1 year ago

> yet o1 with high reasoning could only solve 16.5% of tasks formatted this way.

48.5% with pass@7 though, and presumably o3 would do better... they don't report the inference costs but I'd be shocked if they weren't substantially less than the payouts. I think it is pretty clear that there is real economic value here, and it does make me nervous for the future of the profession, moreso than any prior benchmark.

I agree it isn't perfect. Only tests TS/JS and the vast majority of the tasks are front-end, still none of the mainstream software engineering benchmarks test anything but JS/Python/sometimes Java.

> Turns out, it's worse than Claude Sonnet when it comes to coding.

This was an interesting takeaway for me too. At first I thought that it suggested reasoning models mostly only help with small-scale, well-defined reasoning tasks, but they report o1's pass@1 going from 9.3% at low reasoning effort to 16.5% with high reasoning effort, so I don't think that can be the case.

Bjorkbat 1 year ago

Yeah, I saw the pass@7 figure as well, and I'm not sure what to make of it. On the one hand, solving nearly half of all tasks is impressive. On the other hand, a machine that might do something correctly if you give it 7 attempts isn't particularly enjoyable to use.

moralestapia 1 year ago

That's why I wrote "the writing is on the wall".

It will happen, it's just a matter of time, a couple years perhaps.

ianbutler 1 year ago

3.5 Sonnet Yes IC SWE (Diamond) N/A 26.2% $58k / $236k 24.5%

But sonnet solved over 25% of them and made 60 grand.

That's a substantial amount of work. I don't entirely disagree with you about it being premature but these things are clearly providing substantial value.

Bjorkbat 1 year ago
>But sonnet solved over 25% of them and made 60 grand.
Technically it didn’t since all these tasks were done some time ago. On that note, I feel like putting a dollar amount on the tasks it was able to complete is misleading.
In the real world, if a model masquerading as a human is only right 25% of the time, its reviews on Upwork would reflect that and it would never be able to find work ever again. It might make a couple thousand before it loses trust.
Of course things would be different if they were open and upfront about this being an LLM, in which case it would presumably never run out of trust.
And again, Expensify is an anomaly among companies in that it gives freelancers well articulated tasks to work on. The real world is much more messy.
- ianbutler 1 year ago
  
  That's a lot of qualifying you have to do to discount this which that's fine but my take is you do that at your own peril as we look to the future of this tech.
  The real world is messy but the real world also adapts to the most cost effective solution even if it's just alright.
  People will spend more time specifying their task for an LLM based tool if it gets the job done and costs a fraction of a freelancer.