Comment by catigula
6 hours ago
I know this is a little controversial but the lack of performance on SWE-bench is hugely disappointing I think economically. These models don’t have any viable path to profitability if they can’t take engineering jobs.
I thought that but it does do a lot better on other benchmarks.
Perhaps SWE bench just doesn't capture a lot of the improvement? If the web design improvements people have been posting on twitter, I suspect this will be a huge boon for developers. SWE benchmark is really testing bugfixing/feature dev more.
Anyway let's see. I'm still hyped!
It seems the benchmarks that had a big jump had to do with visual capabilities. I wonder how that will translate to improvements to the workloads LLMs are currently used for (or maybe it will introduce new workloads).
SWE Bench doesn't even test bugfixing / feature dev properly after you achieve roughly 70% if you don't benchmaxx it .
That would be great! But AI is a bubble if these models can’t do serious engineering work.
People here, and in tech in general, are so lost in the sauce.
According to at least OpenAI, who probably produces the most tokens (if we don't count google AI overviews and other unrequested AI bolt-ons) out of all the labs, programming tokens account for ~4% of total generations.
That's nothing. The returns will come from everyone and their grandma paying $30-100/mo to use the services, just like everyone pays for a cell phone and electricity.
Don't be fooled, we are still in the "Open hands" start-up business phase of LLMs. The "enshitification" will follow.
Really? If they can make an engineer more productive, that's worth a lot. Naive napkin math: 1.5X productivity on one $200k/year engineer is worth $100k/year.
People generally dont understand what these models are doing to engineering salaries. The skill level required to produce working software is going way down