Comment by moralestapia
5 months ago
The writing is very clearly on the wall.
On a non-pessimist note, I don't think the SWE role will disappear, but what's the best one could do to be prepared for this?
5 months ago
The writing is very clearly on the wall.
On a non-pessimist note, I don't think the SWE role will disappear, but what's the best one could do to be prepared for this?
I think that's a premature conclusion to take from this benchmark.
Something to keep in mind is that Expensify is kind of an anomaly in that it hires freelancers by creating a well-articulated Github issue and telling them to go solve that. This is about as ideal as you can hope to ask for when it comes to articulating requirements, and yet o1 with high reasoning could only solve 16.5% of tasks formatted this way.
Not to mention, these models perform a lot worse than their SWE-bench results would otherwise suggest.
Big picture, there's a funny trend when it comes to generative AI of inflated expectations that rapidly deflate once we use them in the real world. I still remember being a little bit freaked out by o1 when it came out because it scored so well on a number of benchmarks. Turns out, it's worse than Claude Sonnet when it comes to coding. Our expectations are consistently inflated by hype and benchmarks, but then once we use them in the real world we find out that they're not as great as the benchmarks would otherwise suggest.
Kind of feels like this is going to go on forever. A new model is announced, teased with crazy benchmark results, once people get their hands on it they're slightly underwhelmed by how it performs in the real world.
> yet o1 with high reasoning could only solve 16.5% of tasks formatted this way.
48.5% with pass@7 though, and presumably o3 would do better... they don't report the inference costs but I'd be shocked if they weren't substantially less than the payouts. I think it is pretty clear that there is real economic value here, and it does make me nervous for the future of the profession, moreso than any prior benchmark.
I agree it isn't perfect. Only tests TS/JS and the vast majority of the tasks are front-end, still none of the mainstream software engineering benchmarks test anything but JS/Python/sometimes Java.
> Turns out, it's worse than Claude Sonnet when it comes to coding.
This was an interesting takeaway for me too. At first I thought that it suggested reasoning models mostly only help with small-scale, well-defined reasoning tasks, but they report o1's pass@1 going from 9.3% at low reasoning effort to 16.5% with high reasoning effort, so I don't think that can be the case.
Yeah, I saw the pass@7 figure as well, and I'm not sure what to make of it. On the one hand, solving nearly half of all tasks is impressive. On the other hand, a machine that might do something correctly if you give it 7 attempts isn't particularly enjoyable to use.
That's why I wrote "the writing is on the wall".
It will happen, it's just a matter of time, a couple years perhaps.
3.5 Sonnet Yes IC SWE (Diamond) N/A 26.2% $58k / $236k 24.5%
But sonnet solved over 25% of them and made 60 grand.
That's a substantial amount of work. I don't entirely disagree with you about it being premature but these things are clearly providing substantial value.
>But sonnet solved over 25% of them and made 60 grand.
Technically it didn’t since all these tasks were done some time ago. On that note, I feel like putting a dollar amount on the tasks it was able to complete is misleading.
In the real world, if a model masquerading as a human is only right 25% of the time, its reviews on Upwork would reflect that and it would never be able to find work ever again. It might make a couple thousand before it loses trust.
Of course things would be different if they were open and upfront about this being an LLM, in which case it would presumably never run out of trust.
And again, Expensify is an anomaly among companies in that it gives freelancers well articulated tasks to work on. The real world is much more messy.
1 reply →
How did you draw that conclusion from reading the contents of the link? This is a benchmark.
> We evaluate model performance and find that frontier models are still unable to solve the majority of tasks.
If the writing is on the wall shouldn't we be seeing a massive boost in open source contributions? Shouldn't we be seeing a spike in new kernels, operating systems, network stacks, database, programming languages, frameworks, libraries...?
It could also be argued that contributions will go way down. People who think AI can slowly one-shot many tasks will have less need for "re-use" and "open-source software". In fact if they aren't SWE's by trade (and just using AI direct) they may not even have experienced the open source culture at all. If it works who cares how?
There are opposing theories that with AI we will see less open source contributions, new tech (outside AI), libraries, etc. There is also less incentive to post code up these days as in the age of AI many no longer want to make their code public.
People will always contribute to open source. If AI agents are so good why aren’t people building open source projects around agents? The computing power of many agents would greater than that of a sole agent. As of right now we’re not really seeing anything the sort.
Yeah, why help add support for my device when I can just type "LLM create driver Logitech mouse & install" :P
1 reply →
1. o1 was only released to the public 2 months ago. o3 was only released to the public (in an unusual and less directly-usable-for-that form) 2 weeks ago.
The subset of people who might do that and are paying sufficient attention to this are still reeling, and are mostly otherwise occupied.
2. A lily pad is growing in a pond and it doubles in size every day. After 30 days it covers the entire pond. On what day does it cover half the pond? https://i.imgur.com/grNJAZO.jpeg
> people who might do that and are paying sufficient attention to this are still reeling
What are they doing?
o3 hasn't been released yet, just o3-mini
1 reply →
There will always be "real thinking" roles in software but the sheer pressure on salaries from the vastly increasing free labour pool will lead to an outcome a bit like embedded software development, where rates don't really match the skill level. I think the most obvious strategy for the time being is figuring out how to become a buyer of the services you understand rather than a badly crowded out seller
If the AI is really that good, we could use it to develop replacements all the existing commercial software (ie Windows, Oracle, SAP, Adobe etc) to put those companies out of business as payback.
If the AI is really that good, it could also replace the people using all the existing commercial software. And the people managing them.
No, the next goal is to build AI models to replace sales, marketing, middle managers, VPs and CEOs. Then we will have a complete stack called 'Corporate AI (tm)'
If the AI is really that good, it could replace the people managing the software to create the AI.
2 replies →
The software is possible to replace, the deep interconnection to these softwares isn't
Could be; definitely shows their intent and focus. It definitely seems they are targeting the SWE profession first and foremost (OpenAI); at least it seems that way to an outside observer. Time will tell whether it is a success or not but you can definitely see what they are targeting (vs other potential domains).
and that writing says "we need to find investor money before the FOMO is over"
Did you read the paper? The conclusions don't suggest that.
I think this part of the conclusion is pretty foreboding for the whole profession. It seems like there is a lot of cognitive dissonance on interpreting what the future holds for engineers in the software industry.
“However, they could also shift labor demand-especially in the short term for entry-level and freelance software engineers-and have broader long-term implications for the software industry.”