← Back to context

Comment by pscanf

3 hours ago

I quite like the GPT models when chatting with them (in fact, they're probably my favorites), but for agentic work I only had bad experiences with them.

They're incredibly slow (via official API or openrouter), but most of all they seem not to understand the instructions that I give them. I'm sure I'm _holding them wrong_, in the sense that I'm not tailoring my prompt for them, but most other models don't have problem with the exact same prompt.

Does anybody else have a similar experience?

I ran 5.4 Pro on some data analytics (admittedly it was 300+ pages). It took forever. Ran the same on Sonnet 4.6, night and day difference. I understand it's like using a V8 engine for a V4 task, but I was curious. These new models look promising though. I'd rather use something like a Haiku most of the time over the best rated. I'm not a rocket scientist or solving the mysteries of the universe. They seem to do a great job 80% of the time.

These little 5.4 ones are relatively low latency and fast which is what I need for voice applications. But can't quite follow instructions well enough for my task.

That's really the story of my life. Trying to find a smart model with low latency.

Qwen 3.5 9b is almost smart enough and I assume I can run it on a 5090 with very low latency. Almost. So I am thinking I will fine tune it for my application a little.

Opinions are my own.

For agentic work, both Gemini 3.1 and Opus 4.6 passed the bar for me. I do prefer Opus because my SIs are tuned for that, and I don't want to rewrite them.

But ChatGPT models don't pass the bar. It seems to be trained to be conversational and role-playing. It "acts" like an agent, but it fails to keep the context to really complete the task. It's a bit tiring to always have to double check its work / results.

  • I find both Opus 4.6 and GPT-5.4 have weaknesses but tend to support each other. Someone described it to me jokingly as "Claude has ADHD and Codex is autistic." Claude is great at doing something until it gets done and will run for hours on a task without feedback, Codex is often the opposite: it will ask for feedback often and sometimes just stop in the middle of a task saying it's done with step 1 of 5. On the other hand, Codex is a diligent reviewer and will find even subtle bugs that Claude created in its big long-running "until its done" work mode.

    • Seems like the diagnoses are backwards, in this case. Claude usually stays on task no matter what, but lately Opus 4.6 is showing signs of overuse. I never used to get overload/internal server error messages, but I've seen about a half-dozen of them today alone. And it has been prone to blowing off subtasks that I'd have expected it to resolve.

Yea absolutely. I am using GPT 5.2 / 5.2 Codex with OpenCode and it just doesn't get what I am doing or looses context. Claude on the other side (via GitHub Copilot) has no problem and also discovers the repository on it's own in new sessions while I need to basically spoonfeed GPT. I also agree on the speed. Earlier today I tasked GPT 5.2 Codex with a small refactor of a task in our codebase with reasoning to high and it took 20 minutes to move around 20 files.

I've had such the opposite experience, but mainly doing agentic coding & little chat.

Codex is an ice man. Every other model will have a thinking output that is meaningful and significant, that is walking through its assumptions. Codex outputs only a very basic idea of what it's thinking about, doesn't verbalize the problem or it's constraints at all.

Codex also is by far the most sycophantic model. I am a capable coder, have my charms, but every single direction change I suggest, codex is all: "that's a great idea, and we should totally go that [very different] direction", try as I might to get it to act like more of a peer.

Opus I think does a better job of working with me to figure out what to build, and understanding the problem more. But I find it still has a propensity for making somewhat weird suggestions. I can watch it talk itself into some weird ideas. Which at least I can stop and alter! But I find its less reliable at kicking out good technical work.

Codex is plenty fast in ChatGPT+. Speed is not the issue. I'm also used to GLM speeds. Having parallel work open, keeping an eye on multiple terminals is just a fact of life now; work needs to optimize itself (organizationally) for parallel workflows if it wants agentic productivity from us.

I have enormous respect for Codex, and think it (by signficiant measure) has the best ability to code. In some ways I think maybe some of the reason it's so good is because it's not trying to convey complex dimensional exploration into a understandable human thought sequence. But I resent how you just have to let it work, before you have a chance to talk with it and intervene. Even when discussing it is extremely extremely terse, and I find I have to ask it again and again and again to expand.

The one caveat i'll add, I've been dabbling elsewhere but mainly i use OpenCode and it's prompt is pretty extensive and may me part of why codex feels like an ice man to me. https://github.com/anomalyco/opencode/blob/dev/packages/open...

  • > I've had such the opposite experience

    Yeah, I've actually heard many other people swear by the GPTs / Codex. I wonder what factors make one "click" with a model and not with another.

    > Codex is an ice man.

    That might be because OpenAI hides the actual reasoning traces, showing just a summary (if I understood correctly).

    • OpenClaw guy (he's Austrian, it's relevant) much prefers Codex over Claude and articulated it as being due to Claude's output feeling very "American" and Codex's output feeling very "German", and I personally really agree with the sentiment.

      As an American, Claude feels much more natural to me, with the same overly-optimistic "move fast, break things" ethos that permeates our culture. It takes bigger swings (and misses) at harder-to-quantify concepts than Codex, cuts corners (not intentionally, but it feels like a human who's just moving too fast to see the forest for the trees in the moment), etc. Codex on the other hand feels more grounded, more prone to trying to aggregate blind spots, edge cases, and cover the request more thoroughly than Claude. It's far more pedantic and efficient, almost humorless. The dude also claimed that most of the Codex team is European while Claude team is American, and suggested that as an influence on why this might be.

      Anyways, I've found that if I force Claude and Codex to talk to each other, I can get way better results and consistency by using Claude to generate fairly good plans from my detailed requests that it passes to Codex for review and amendment, Claude incorporates the feedback and implements the code, then Codex reviews the commit and patches anything Claude misses. Best of both worlds. YMMV

      1 reply →

Same, and I can't put my finger on the "why" either. Plus I keep hitting guard rails for the strangest reasons, like telling codex "Add code signing to this build pipeline, use the pipeline at ~/myotherproject as reference" and codex tells me "You should not copy other people's code signing keys, I can't help you with this"

Are you requesting reasoning via param? That was a mistake I was making. However with highest reasoning level I would frequently encounter cyber security violation when using agent that self-modifies.

I prefer Claude models as well or open models for this reason except that Codex subscription gets pretty hefty token space.

  • Yes, I think? But I was talking more specifically about using the models via API in agents I develop, not for agentic coding. Though, thinking about it, I also don't click with the GPT models when I use them for coding (using Codex). They just seem "off" compared to Claude.

    • I like GPT models in Codex, for a fully vibecoded experience (I don't look at code) for my side-projects. In there, they really get the job done: you plan, they say what they'll do, and it shows up done. It's rare I need to push back and point out bugs. I really can't fault them for this very specific use-case.

      For anything else, I can't stand them, and it genuinely feels like I am interacting with different models outside of codex:

      - They act like terribly arrogant agents. It's just in the way they talk: self-assured, assertive. They don't say they think something, they say it is so. They don't really propose something, they say they're going to do it because it's right.

      - If you counter them, their thinking traces are filled with what is virtually identical to: "I must control myself and speak plainly, this human is out of his fucking mind"

      - They are slow. Measurably slow. Sonnet is so much faster. With Sonnet models, I can read every token as it comes, but it takes some focusing. With GPT, I can read the whole trace in real-time without any effort. It genuinely gives off this "dumb machine that can't follow me" vibe.

      - Paradoxically, even though they are so full of themselves, they insist upon checking things which are obvious. They will say "The fix is to move this bit of code over there [it isn't]" and then immediately start looking at sort of random files to check...what exactly?

      - I feel they make perhaps as many mistakes as Sonnet, but they are much less predictable mistakes. The kind that leaves me baffled. This doesn't have to be bad for code quality: Sonnet makes mistakes which _might_ at points even be _harder_ to catch, so might be easier to let slip by. Yet, it just imprints this feeling of distrust in the model which is counter-productive to make me want to come back to it

      I didn't compare either with Gemini because Gemini is a joke that "does", and never says what it is "doing", except when it does so by leaving thinking traces in the middle of python code comments. Love my codebase to have "But wait, ..." in the middle of it. A useless model.

      I've recently started saying this:

      - Anthropic models feel like someone of that level of intelligence thinking through problems and solving them. Sonnet is not Opus -- it is sonnet-level intelligence, and shows it. It approaches problems from a sensible, reasonably predictable way.

      - Gemini models feel like a cover for a bunch of inferior developers all cluelessly throwing shit at the wall and seeing what sticks -- yet, ultimately, they only show the final decision. Almost like you're paying a fraudulent agency that doesn't reveal its methods. The thinking is nonsensical and all over the place, and it does eventually achieve some of its goals, but you can't understand what little it shows other than "Running command X" and "Doing Y".

      On a final note: when building agentic applications, I used to prefer GPT (a year ago), but I can't stand it now. Robotic, mechanic, constantly mis-using tools. I reach for Sonnet/Opus if I want competence and adherence to prompt, coupled with an impeccable use of tools. I reach for Gemini (mostly flash models) if I want an acceptable experience at a fraction of the price and latency.

      2 replies →

    • I am also talking about agents I'm developing. They just happen to be self-modifying but they're not for agentic coding. You have to explicitly send the reasoning effort parameter. If you set effort to None (default for gpt-5.4) you get very low intelligence.

      1 reply →

  • > cyber security violation

    Would you mind expanding on this? Do you mean in the resulting code? Or a security problem on your local machine?

    I naively use models via our Copilot subscription for small coding tasks, but haven't gone too deep. So this kind of threat model is new to me.