Comment by bencyoung

20 days ago

Some example PRs if people want to look:

https://github.com/dotnet/runtime/pull/115733 https://github.com/dotnet/runtime/pull/115732 https://github.com/dotnet/runtime/pull/115762

7 comments

bencyoung

sensanaty 20 days ago

That first PR (115733) would make me quit after a week if we were to implement this crap at my job and someone forced me to babysit an AI in its PRs in this fashion. The others are also rough.

A wall of noise that tells you nothing of any substance but with an authoritative tone as if what it's doing is objective and truthful - Immediately followed by:

- The 8 actual lines of code (discounting the tests & boilerplate) it wrote to actually fix the issue is being questioned by the person reviewing the code, it seems he's not convinced this is actually fixing what it should be fixing.

- Not running the "comprehensive" regression tests at all

- When they do run, they fail

- When they get "fixed" oh-so confidently, they still fail. Fifty-nine failing checks. Some of these tests take upward of an hour to run.

So the reviewer here has to read all the generated slop in the PR description and try to grok what the PR is about, read through the changes himself anyway (thankfully it's only a ~50 line diff in this situation, but imagine if this was a large refactor of some sort with a dozen files changed), and then drag it by the hand multiple times to try fix issues it itself is causing. All the while you have to tag the AI as if it's another colleague and talk to it as if it's not just going to spit out whatever inane bullshit it thinks you want to hear based on the question asked. Test failed? Well, tests fixed! (no, they weren't)

And we're supposed to be excited about having this crap thrust on us, with clueless managers being sold on this being a replacement for an actual dev? We're being told this is what peak efficiency looks like?

joejoo 20 days ago

[dead]

acdha 20 days ago

Thanks, that’s really interesting to see - especially with the exchange around whether something is the problem or the symptom, where the confident tone belies the lack of understanding. As an open source maintainer I wonder about the best way to limit usage to cases where someone has time to spend on those interactions.

bencyoung 20 days ago

Seems amazing similar to the changes a junior would make (jump to the solution that "fixes" it in the most shallow way) at the moment

bearjaws 20 days ago

That first PR is rough. Why does it have to wait for a comment to fix failing tests?

yahoozoo 19 days ago

lol, those first two… poor Stephen

replwoacause 20 days ago

Thanks. I wonder what model they're using under the hood? I have such a good experience working with Cline and Claude Sonnet 3.7 and a comparatively much worse time with anything Github offers. These PRs are pretty consistent with the experience I've had in the IDE too. Incidentally, what has MSFT done to Claude Sonnet 3.7 in VSCode? It's like they lobotomized it compared to using it through Cline or the API directly. Trying to save on tokens or something?