Comment by lunar_mycroft

2 hours ago

First, to clarify my own position here: I use LLMs for code review, to help with some planning, for the occasional throwaway prototype, and as a more advanced rubber duck, but I do not let LLMs write code I care about, even with human review (because human review is imperfect).

> its uneven... if its saying it now clearly has superior abilities in many parts of the SWE workflow then its demonstrably right.

In what ways are current LLMs better excluding speed and cost (because getting those things by relaxing constraints on quality has always been trivially possible)? Even the fabled (heh) Mythos seems to be at best roughly equivalent to a competent human security researcher.

> I think you are saying "people keep saying it's good enough to replace SWEs(?)" every six months and they are wrong every time. I don't disagree that we have not gotten to a "we dont need SWEs anymore" point, but I think its a bit of a strawman: who is making the claim you are addressing?

Most of them aren't saying that the models are good enough to full replace developers, but this definitely isn't a strawman. I've been seeing the same basic claim for at least 18 months at this point.

> No no, I'm talking about performance in absolute terms. These are strong proxies (SWE-bench, etc)

Unless you have some non-LLM scores to compare to, those are still relative measures. They show/suggest that LLMs are getting better (at least in some ways), but without a definition of "good enough" in the same metric, that isn't sufficient to say whether or not they are.

> No but then the point I'm making is we're drifting further and further away from Occam's razor.

Both sides of the debate have to explain the fact that a lot of developers disagree with them, so I don't think this argument really works.

1 comment

lunar_mycroft

aspenmartin 38 minutes ago

> I do not let LLMs write code I care about, even with human review (because human review is imperfect).

That's fine but you are in the quickly vanishing minority.

> In what ways are current LLMs better excluding speed and cost (because getting those things by relaxing constraints on quality has always been trivially possible)? Even the fabled (heh) Mythos seems to be at best roughly equivalent to a competent human security researcher.

Well this is what I mean by benchmarks and measurement efforts. Lots of gaps in capabilities but we've had say superhuman competitive programming performance for awhile (including on fresh tasks not in training sets), extremely strong performance (super-p90-engineer) on say language-to-language porting, RE-bench (ML research engineering benchmark from METR) is already clearly above human perf, Mythos clearly (unless you believe this is all a massive fraud) has superior cyber capabilities, etc. Also, why do you discount speed and cost so much?

> Most of them aren't saying that the models are good enough to full replace developers, but this definitely isn't a strawman. I've been seeing the same basic claim for at least 18 months at this point.

Yea but what's the basic claim you're referring to here? Every model iteration is a significant bump up in performance according to a lot of complementary and principled measurements. What's been the thing that hasn't been true?

> Unless you have some non-LLM scores to compare to, those are still relative measures. They show/suggest that LLMs are getting better (at least in some ways), but without a definition of "good enough" in the same metric, that isn't sufficient to say whether or not they are.

There are human baselines in plenty of these benchmarks number one, and number two while no one is going to be able to tell you "once SWE-Bench Pro perf numbers get to X we can then refactor our existing process to completely offload task Y to agentic frameworks" thats a bit of a crazy ask. These numbers are pretty interpretable and many are pretty robust to things like training set leakage. What would you want to see here?

> Both sides of the debate have to explain the fact that a lot of developers disagree with them, so I don't think this argument really works.

Yet one side has a mountain of hard evidence and the other side has...an outdated n < 20 METR study using Sonnet 3?