Comment by andreagrandi

3 days ago

I'm only waiting for OpenAI to provide an equivalet ~100 USD subscription to entirely ditch Claude.

Opus has gone down the hill continously in the last week (and before you start flooding with replies, I've been testing opus/codex in parallel for the last week, I've plenty of examples of Claude going off track, then apologising, then saying "now it's all fixed!" and then only fixing part of it, when codex nailed at the first shot).

I can accept specific model limits, not an up/down in terms of reliability. And don't even let me get started on how bad Claude client has become. Others are finally catching up and gpt-5.3-codex is definitely better than opus-4.6

Everyone else (Codex CLI, Copilot CLI etc...) is going opensource, they are going closed. Others (OpenAI, Copilot etc...) explicitly allow using OpenCode, they explicitly forbid it.

This hostile behaviour is just the last drop.

54 comments

andreagrandi

super256 2 days ago

OpenAI forces users to verify with their ID + face scan when using Codex 5.3 if any of your conversations was redeemed as high risk.

It seems like they currently have a lot of false positives: https://github.com/openai/codex/issues?q=High%20risk

andreagrandi 2 days ago
They haven't asked me yet (my subscription is from work with a business/team plan). Probably my conversations as too boring
- stogot 2 days ago
  
  Try something not boring and see what happens?

abm53 2 days ago

I’m unsure exactly in what way you believe it has gone “down the hill” so this isn’t aimed at you specifically but more a general pattern I see.

That pattern is people complaining that a particular model has degraded in quality of its responses over time or that it has been “nerfed” etc.

Although the models may evolve, and the tools calling them may change, I suspect a huge amount of this is simply confirmation bias.

seu 3 days ago

> Opus has gone down the hill continously in the last week

Is a week the whole attention timespan of the late 2020s?

latexr 3 days ago
We’re still in the mid-late 2020s. Once we really get to the late 2020s, attention spans won’t be long enough to even finish reading your comment. People will be speaking (not typing) to LLMs and getting distracted mid-sentence.
- eamag 3 days ago
  
  Reminded https://www.baen.com/Chapters/9781618249203/9781618249203___...
  
  1 reply →
- imafish 3 days ago
  
  Seems we're already there.
  My brain trailed off after "won’t be long enough to even finish"...
  
  2 replies →
- sharperguy 2 days ago
  
  I would even call it mid 2020s. I think in a couple years people's attention spans will be so short they won't even finish reading comments.
_kb 2 days ago

Unfortunately, and “Attention Is All You Need”.
marcus_holmes 3 days ago
oh shit we're in the late 2020's now
- testdelacc1 3 days ago
  
  Sorry, I don’t agree. And I won’t be taking questions at this time.

ifwinterco 3 days ago

Opus 4.6 genuinely seems worse than 4.5 was in Q4 2025 for me. I know everyone always says this and anecdote != data but this is the first time I've really felt it with a new model to the point where I still reach for the old one.

I'll give GPT 5.3 codex a real try I think

Esophagus4 2 days ago
Huh… I’ve seen this comment a lot in this thread but I’ve really been impressed with both Anthropic’s latest models and latest tooling (plugins like /frontend-design mean it actually designs real front ends instead of the vibe coded purple gradient look). And I see it doing more planning and making fewer mistakes than before. I have to do far less oversight and debugging broken code these days.
But if people really like Codex better, maybe I’ll try it. I’ve been trying not to pay for 2 subscriptions at once but it might be worth a test.
- misnome 2 days ago
  
  > And I see it doing more planning and making fewer mistakes than before
  Anecdotally, maybe this is the reason? It does seem to spend a lot more time “thinking” before giving what feels like equivalent results, most of the time.
  Probably eats into the gambling-style adrenaline cycles.
mosselman 3 days ago
I asked Codex 5.3 and Opus 4.6 to write me a macos application with a certain set of requirements.
Opus 4.6 wrote me a working macos application.
Codex wrote me a html + css mockup of a macos application that didn't even look like a macos application at all.
Opus 4.5 was fine, but I feel that 4.6 is more often on the money on its implementations than 4.5 was. It is just slower.
- prodigycorp 2 days ago
  
  Codex has written me 3 very nice mac os applications in the past week lol
- stavros 3 days ago
  
  I asked both to help me with a hardware bug. Codex kept trying things, being sure of what the problem is every time, and every time making it worse.
  Opus went off and browsed my dependencies for ten minutes, and came back and solved the problem firs try.
  
  3 replies →
- baq 2 days ago
  
  Literally a skill issue.
kilroy123 3 days ago
I agree with you. Codex 5.3 is good it's just a bit slower.
- andreagrandi 3 days ago
  
  It is (slower), especially at xhigh setting. But if I have to redo things three times, keep confirming trivial stuff (Claude Code seems to keep changing the commands it uses to read code... once it uses "bash-read", once it uses "tree", once it uses "head" and I have to keep confirming permission), I definitely waste more time than give a command to codex (or in my case OpenCode + codex model) and come back after 10 minutes.

trillic 2 days ago

The rate limit for my $20 OpenAI / Codex account feels 10x larger than the $20 claude account.

choilive 2 days ago

YES. I hit the rate limit in about ~15 mins on Claude. But it will take me a few hours with Codex. A/B testing them on the same tasks. Same $20/mo.

GorbachevyChase 2 days ago

I was underwhelmed by Opus4.6. I didn’t get a sense of significant improvement, but the token usage was excessive to the point that I dropped the subscription for codex. I am suspect that all the models are so glib that they can create a quagmire for themselves in a project. I have not yet found a satisfying strategy for non-destructive resets when the systems own comments and notes poisons new output. Fortunately, deleting and starting over is cheap.

dannersy 3 days ago

No offense, but this is the most predicable outcome ever. The software industry at large does this over and over again and somehow we're surprised. Provide thing for free or for cheap, and then slowly draw back availability once you have dominant market share or find yourself needing money (ahem).

The providers want to control what AI does to make money or dominate an industry so they don't have to make their money back right away. This was inevitable, I do not understand why we trust these companies, ever.

NamlchakKhandro 3 days ago
because it's easier than paying $50k for local llm setup that might not last 5 years.
- dannersy 3 days ago
  
  Well, yes. They know what they are doing. They know when given the option the consumer makes the affordable choice. I just don't have to like or condone their practices. Maybe instead of taking on billions of dollars of debt they should have thought about a business model that makes sense first? Maybe the collective "we" (consumers and investors, but especially investors) should keep it in our pants until the product is proven and sustainable?
  It will be real interesting if the haters are right and this technology is not the breakthrough the investors assume it to be AFTER it is already sewn into everyone's work flows. Everyone keeps talking about how jobs will be displaced, yet few are asking what happens when a dependency is swept out from underneath the industry as a whole if/when this massive gamble doesn't pay off.
  Whatever. I am squawking into the void as we just repeat history.
  
  1 reply →
- player1234 2 days ago
  
  [dead]
andreagrandi 2 days ago

No offense taken here :)
First, we are not talking about a cheap service here. We are talking about a monthly subscription which costs 100 USD or 200 USD per month, depending on which plan you choose.
Second, it's like selling me a pizza and pretending I only eat it while sitting at your table. I want to eat the pizza at home. I'm not getting 2-3 more pizzas, I'm still getting the same pizza others are getting.

neya 2 days ago

It's the most overrated model there is. I do Elixir development primarily and the model sucks balls in comparison to Gemini and GPT-5x. But the Claude fanboys will swear by it and will attack you if you ever say even something remotely negative about their "god sent" model. It fails miserably even in basic chat and research contexts and constantly goes off track. I wired it up to fire up some tasks. It kept hallucinating and swearing it did when it didn't even attempt to. It was so unreliable I had to revert to Gemini.

resiros 2 days ago

It might simply be that it was not trained enough in Elixir RL environments compared to Gemini and gpt. I use it for both ts and python and it's certainly better than Gemini. For Codex, it depends on the task.

thepasch 2 days ago

> I’m only waiting for OpenAI to provide an equivalet ~100 USD subscription to entirely ditch Claude.

I have a feeling Anthropic might be in for an extremely rude awakening when that happens, and I don’t think it’s a matter of “if” anymore.

submain 2 days ago

> And don't even let me get started on how bad Claude client has become

The latest versions of claude code have been freezing and then crashing while waiting on long running commands. It's pretty frustrating.

WarmWash 2 days ago

My favorite conspiracy explanation:

Claude has gotten a lot of popular media attention in the last few weeks, and the influx of users is constraining compute/memory on an already compute heavy model. So you get all the suspected "tricks" like quantization, shorter thinking, KV cache optimizations.

It feels like the same thing that happened to Gemini 3, and what you can even feel throughout the day (the models seem smartest at 12am).

Dario in his interview with dwarkesh last week also lamented the same refrain that other lab leaders have: compute is constrained and there are big tradeoffs in how you allocate it. It feels safe to reason then that they will use any trick they can to free up compute.

cactusplant7374 2 days ago

No developer writes the same prompt twice. How can you be sure something has changed?

kasey_junk 2 days ago

I regularly run the same prompts twice and through different models. Particularly, when making changes to agent metadata like agent files or skills.
At least weekly I run a set of prompts to compare codex/claude against each other. This is quite easy the prompt sessions are just text files that are saved.
The problem is doing it enough for statistical significance and judging the output as better or not.
andreagrandi 2 days ago
I suspect you may not be writing code regularly... If I have to ask Claude the same things three times and it keeps saying "You are right, now I've implemented it!" and the code is still missing 1 out of 3 things or worse, then I can definitely say the model has become worse (since this wasn't happening before).
- cactusplant7374 2 days ago
  
  > I suspect you may not be writing code regularly...
  You have no reason to suspect this.
- co_king_5 2 days ago
  
  What model were you using where this wasn't happening before?
  
  1 reply →
SkyPuncher 2 days ago
When I use Claude daily (both professionally and personally with a Max subscription), there are things that it does differently between 4.5 and 4.6. It's hard to point to any single conversation, but in aggregate I'm finding that certain tasks don't go as smoothly as they used to. In my view, Opus 4.6 is a lot better at long standing conversations (which has value), but does worse with critical details within smaller conversations.
A few things I've noticed:
* 4.6 doesn't look at certain files that it use to
* 4.6 tends to jump into writing code before it's fully understood the problem (annoying but promptable)
* 4.6 is less likely to do research, write to artifacts, or make external tool calls unless you specifically ask it to
* 4.6 is much more likely to ask annoying (blocking) questions that it can reasonably figure out on it's own
* 4.6 is much more likely to miss a critical detail in a planning document after being explicitly told to plan for that detail
* 4.6 needs to more proactively write its memories to file within a conversation to avoid going off track
* 4.6 is a lot worse about demonstrating critical details. I'm so tired of it explaining something conceptually without it thinking about how it implements details.
- SkyPuncher 2 days ago
  
  Just hit a situation where 4.6 is driving me crazy.
  I'm working through a refactor and I explicitly told it to use a block (as in Ruby Blocks) and it completely overlooked that. Totally missed it as something I asked it to do.
baq 2 days ago
Ralph Wiggum would like a word
- cactusplant7374 2 days ago
  
  Same prompt assumes same context state. But I think you get what I mean.

bbstats 2 days ago

all this because of a single week?

andreagrandi 2 days ago

No, it's not the first time their models degrade for some time.