Comment by martinald

7 hours ago

I thought that but it does do a lot better on other benchmarks.

Perhaps SWE bench just doesn't capture a lot of the improvement? If the web design improvements people have been posting on twitter, I suspect this will be a huge boon for developers. SWE benchmark is really testing bugfixing/feature dev more.

Anyway let's see. I'm still hyped!

It seems the benchmarks that had a big jump had to do with visual capabilities. I wonder how that will translate to improvements to the workloads LLMs are currently used for (or maybe it will introduce new workloads).

SWE Bench doesn't even test bugfixing / feature dev properly after you achieve roughly 70% if you don't benchmaxx it .

That would be great! But AI is a bubble if these models can’t do serious engineering work.