Comment by wahnfrieden

7 months ago

If you switch to Codex you will get a lot of tokens for $200, enough to more consistently use high reasoning as well. Cursor is simply far more expensive so you end up using less or using dumber models.

Claude Code is overrated as it uses many of its features and modalities to compensate for model shortcomings that are not as necessary for steering state of the art models like GPT 5.2

14 comments

wahnfrieden

MrOrelliOReilly 7 months ago

I think this is a total misunderstanding of Anthropic’s place in the AI race. Opus 4.5 is absolutely a state of the art model. I won’t knock anyone for preferring Codex, but I think you’re ignoring official and unofficial benchmarks.

See: https://artificialanalysis.ai

wahnfrieden 7 months ago

What am I missing? As suspicious as benchmarks are, your link shows GPT 5.2 to be superior.
It is also out of date as it does not include 5.2 Codex.
Per my point about steerability compensated for by modalities and other harness features: Opus 4.5 scores 58% while GPT 5.2 scores 75% for the instruction following benchmark in your link! Thanks for the hard evidence - GPT 5.2 is 30% ahead of Opus 4.5 there. No wonder Claude Code needs those harness features for the user to manually reign in control over its instruction following capability.
woadwarrior01 7 months ago
> Opus 4.5 is absolutely a state of the art model.
> See: https://artificialanalysis.ai
The field moves fast. Per artificialanalysis, Opus 4.5 is currently behind GPT-5.2 (x-high) and Gemini 3 Pro. Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.
- MrOrelliOReilly 7 months ago
  
  Totally, however OP's point was that Claude had to compensate for deficiencies versus a state of the art model like ChatGPT 5.2. I don't think that's correct. Whether or not Opus 4.5 is actually #1 on these benchmarks, it is clearly very competitive with the other top-tier models. I didn't take "state of the art" to here narrowly mean #1 on a given benchmark, but rather to mean near or at the frontier of current capabilities.
- gessha 7 months ago
  
  One thing to remember when comparing ML models of any kind is that single value metrics obscure a lot of nuance and you really have to go through the model results one by one to see how it performs. This is true for vision, NLP, and other modalities.
- ramoz 7 months ago
  
  https://x.com/giansegato/status/2002203155262812529/photo/1
  https://x.com/METR_Evals/status/2002203627377574113
  > Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.
  What an insane take for anybody uses these models daily.
  
  2 replies →
- dr_dshiv 7 months ago
  
  https://lmarena.ai/leaderboard/webdev
  LM Arena shows Claude Opus 4.5 on top
  
  1 reply →
- fzzzy 7 months ago
  
  is x-high fast enough to use as a coding agent?
  
  1 reply →

ccmcarey 7 months ago

I disagree, the claude models seem the best at tool calling, opus 4.5 seems the smartest, and claude code (+ claude model) seems to make good use of subagents and planning in a way that codex doesn't

wahnfrieden 7 months ago

Opus 4.5 is so bad at instruction following (30% worse per benchmark shared above) that it requires a manual toggle for plan mode.
GPT 5.2 simply obeys instruction to assemble a plan and avoids the need to compensate for poor steerability that would require the user to manually manage modalities.
Opus has improved though so the plan mode is less necessary than it was before, but it is still far behind state of art steerability.