Comment by petcat

6 hours ago

I have found that Claude Opus 4.6 is a better reviewer than it is an implementer. I switch off between Claude/Opus and Codex/GPT-5.4 doing reviews and implementations, and invariably Codex ends up having to do multiple rounds of reviews and requesting fixes before Claude finally gets it right (and then I review). When it is the other way around (Codex impl, Claude review), it's usually just one round of fixes after the review.

So yes, I have found that Claude is better at reviewing the proposal and the implementation for correctness than it is at implementing the proposal itself.

9 comments

petcat

ivanech 6 hours ago

Hmm in my experience (I've done a lot of head-to-heads), Opus 4.6 is a weaker reviewer than GPT 5.4 xhigh. 5.4 xhigh gives very deep, very high-signal reviews and catches serious bugs much more reliably. I think it's possible you're observing Opus 4.6's higher baseline acceptance rate instead of GPT 5.4's higher implementation quality bar.

parasti 5 hours ago

This is also my experience using both via Augment Code. Never understood what my colleagues see in Claude Opus, GPT plans/deep dives are miles ahead of what Opus produces - code comprehension, code architecture is unmatched really. I do use Sonnet for implementation/iteration speed after seeding context with GPT.
egeozcan 5 hours ago

I agree. Opus, forget the plan mode - even when using superpowers skill, leaves a lot of stuff dangling after so many review rounds.
Along with claude max, I have a chatgpt pro plan and I find it a life-saver to catch all the silliness opus spits out.
jonnycoder 5 hours ago

I agree, I use codex 5.4 xhigh as my reviewer and it catches major issues with Opus 4.6 implementation plans. I'm pretty close to switching to codex because of how inconsistent claude code has become.
petcat 6 hours ago
Maybe it's all just anecdotal then. Everyone is having different experiences.
Maybe we're being A/B tested.
- femiagbabiaka 6 hours ago
  
  The experience one has with this stuff is heavily influenced by overall load and uptime of Anthopic's inference infra itself. The publicly reported availability of the service is one 9, that says nothing of QoS SLO numbers, which I would guess are lower. It is impossible to have a consistent CX under these conditions.

landonxjames 6 hours ago

I have noticed this as well. I frequently have to tell it that we need to do the correct fix (and then describe it in detail) rather than the simple fix. And even then it continues trying to revert to the simple (and often incorrect) fix.

nrds 5 hours ago

You have to throw the context away at that point. I've experienced the same thing and I found that even when I apparently talk Claude into the better version it will silently include as many aspects of the quick fix as it thinks it can get away with.

enraged_camel 5 hours ago

I have a similar workflow but I disagree with Codex/GPT-5.4 reviews being very useful. For example, in a lot of cases they suggest over-engineering by handling edge cases that won't realistically happen.