Comment by lunar_mycroft
5 hours ago
I've seen the code they produce without extensive help from human developers, this is clearly false.
Good to see the classic "yeah the models weren't good enough six months ago, but this time they actually are, promise! Please forget you were hearing the exact same thing six months ago!" is alive and well though.
Are you aware of performance trends though? You’re painting a picture that seems to ignore how things have consistently trended for many years now, even pre ChatGPT. It is absolutely data driven to say “an inflection point has happened within the last 6 months”. And that was also true 6 months ago (where people started using coding agents fairly consistently since sonnet 4). And it was true 6 months before that. It’s not like people are like “we’ve fixed all the bugs!” And then nothing has changed. I don’t necessarily agree with the parent poster that agents are better than humans but they are certainly much better at many tasks.
> Are you aware of performance trends though? You’re painting a picture that seems to ignore how things have consistently trended for many years now, even pre ChatGPT.
Models have been getting better, but all that follows from that is that newer models tend to be better than older ones. It doesn't follow that they have (or even will in the future) gotten better than anything else, be that human developers, a given definition of good enough, etc.
> It is absolutely data driven to say “an inflection point has happened within the last 6 months”.
With all due respect to OP (who I think is responsible for popularizing that way of phrasing it), I don't think it is when you consider the actual definition of "inflection point". At best I think you can say that models crossed a lot of developers definition of good enough around then, which is a different thing. The problem I have with that is that as a (mostly) outsider looking in, it doesn't seem like they're right.
> Models have been getting better, but all that follows from that is that newer models tend to be better than older ones. It doesn't follow that they have (or even will in the future) gotten better than anything else, be that human developers, a given definition of good enough, etc.
But this is not true, you’re saying we only have relative performance numbers and not absolute measures of capabilities and reliability but that’s simply not true. OSS benchmarks as well as the internal flywheels of these companies are good complementary measurements.
> At best I think you can say that models crossed a lot of developers definition of good enough around then, which is a different thing
That’s the inflection point. Implication is a massive jump in adoption. We’re not like pulling this out of a hat, there are a number of compelling datapoints. The onus is on people to bring actual evidence that contradicts all of the data and observations we have.
2 replies →