Comment by re-thc

10 hours ago

> I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.

GPT can find fault in everything and anything including its own work.

3 comments

re-thc

AI review generally will find fault in anything. Any non-trivial code has multiple solutions with different tradeoffs. Any code can be over-engineered for theoretical edge cases and future use cases you don't need. No matter which solution you pick you can always at a minimum say that some alternative just looks and reads better.

Code is somewhat artistic. If you don't have well defined standards and priorities, the AI review cycle can spiral infinitely figuratively debating what makes art good, and your code will be no better for it.

cmrdporcupine 8 hours ago

This is correct, but I'd say there's something beyond that that's more specific about Codex + GPT models though. They've done some sort of training that makes it far more diligent about seeking out data races, unhandled errors / negative cases, and missing test coverage than the other models I've played with. It also seems more prone to testing its hypothesis.
This makes it slower to work with for prototyping, and it will, if not properly disciplined, litter your code with "legacy adapters" and "bridge code" and temporary incremental refactoring steps [arguably not terrible for work in real commercial software projects]. And it will create too many unit & integration tests, if you're not careful.
But it does, in my opinion, tend to produce more reliable software and I trust it far more than I did when I was working in Claude.
When I could afford it, I had both plans running, Claude to produce new features, and then Codex to brutally critique it battle test it, sharpen the edges, and produce better tests, and this flow went extremely well.
Now I just work with Codex and various open models.

cmrdporcupine 9 hours ago

That's what I love about it, and I wish I could find an open model that was as diligent.

Somehow it's just way more careful than the others, and also much better at empirical verification of its hypothesis, writing tests, etc. I am assuming a lot of RL done on that kind of flow, and on seeking out negative cases, failure points, race conditions.