Comment by amluto

2 months ago

Is it?

This week I asked GPT-5.2 to debug an assertion failure in some code that worked on one compiler but failed on a different compiler. I went through several rounds of GPT-5.2 suggesting almost-plausible explanations, and then it modified the assertion and gave a very confident-sounding explanation of why it was reasonable to do so, but the new assertion didn’t actually check what the old assertion checked. It also spent an impressive of time arguing, entirely incorrectly and based in flawed reasoning that I don’t really think it found in its training set, as to why it wasn’t wrong.

I finally got it to answer correctly by instructing it that it was required to identify the exact code generation difference that caused the failure.

I haven’t used coding models all that much, but I don’t think the older ones would have tried so hard to cheat.

This is also consistent with reports of multiple different vendors’ agents figuring out how to appear to diagnose bugs by looking up the actual committed fix in the repository.

2 comments

amluto

efficax 2 months ago

they all do this at some point. claude loves to delete tests that are failing if it can't fix them. or delete code that won't compile if it can't figure it out

amluto 2 months ago

Huh. A while back I gave up fighting with Claude Code to get it to cheat the ridiculous Home Assistant pre-run integration checklist so I could run some under-development code and I ended up doing it myself.