Comment by gklitt

8 months ago

I tried one task head-to-head with Codex o4-mini vs Claude Code: writing documentation for a tricky area of a medium-sized codebase.

Claude Code did great and wrote pretty decent docs.

Codex didn't do well. It hallucinated a bunch of stuff that wasn't in the code, and completely misrepresented the architecture - it started talking about server backends and REST APIs in an app that doesn't have any of that.

I'm curious what went so wrong - feels like possibly an issue with loading in the right context and attending to it correctly? That seems like an area that Claude Code has really optimized for.

I have high hopes for o3 and o4-mini as models so I hope that other tests show better results! Also curious to see how Cursor etc. incorporate o3.

44 comments

gklitt

strangescript 8 months ago

Claude Code still feels superior. o4-mini has all sorts of issues. o3 is better but at that point, you aren't saving money so who cares.

I feel like people are sleeping on Claude Code for one reason or another. Its not cheap, but its by far the best, most consistent experience I have had.

artdigital 8 months ago
Claude Code is just way too expensive.
These days I’m using Amazon Q Pro on the CLI. Very similar experience to Claude Code minus a few batteries. But it’s capped at $20/mo and won’t set my credit card on fire.
- aitchnyu 8 months ago
  
  Is it using one of these models? https://openrouter.ai/models?q=amazon
  Seems 4x costlier than my Aider+Openrouter. Since I'm less about vibes or huge refactoring, my (first and only) bill is <5 usd with Gemini. These models will halve that.
  
  6 replies →
- monsieurbanana 8 months ago
  
  > Upgrade apps in a fraction of the time with the Amazon Q Developer Agent for code transformation (limit 4,000 lines of submitted code per month)
  4k loc per month seems terribly low? Any request I make could easily go over that. I feel like I'm completely misunderstanding (their fault though) what they actually meant.
  Edit: No I don't think I'm misunderstanding, if you want to go over this they direct you to a pay-per-request plan and you are not capped at $20 anymore
  
  3 replies →
ekabod 8 months ago
"gemini 2.5 pro exp" is superior to Claude Sonnet 3.7 when I use it with Aider [1]. And it is free (with some high limit).
[1]https://aider.chat/
- razemio 8 months ago
  
  Compared to cline aider had no chance, the last time I tried it (4 month ago). Has it really changed? Always thought cline is superior because it focuses on sonnet with all its bells an whistles. While aider tries to be an universal ide coding agent which works well with all models.
  When I try gemmini 2.5 pro exp with cline it does very well but often fails to use the tools provided by cline which makes it way less expensive while failing random basic tasks sonnet does in its sleep. I pay the extra to save the time.
  Do not get me wrong. Maybe I am totally outdated with my opinion. It is hard to keep up these days.
  
  2 replies →
- jacooper 8 months ago
  
  Don't they train on your inputs if you use the free Ai studio api key?
  
  1 reply →
- strangescript 8 months ago
  
  I would use Aider if it had an agent mode. It needs to catch up with UX, frankly just have a mode that copies what claude code does.
Aeolun 8 months ago
> Its not cheap, but its by far the best, most consistent experience I have had.
It’s too expensive for what it does though. And it starts failing rapidly when it exhausts the context window.
- jasonjmcghee 8 months ago
  
  If you get a hang of controlling costs, it's much cheaper. If you're exhausting the context window, I'm not surprised you're seeing high cost.
  Be aware of the "cache".
  Tell it to read specific files, never use /compact (that'll bust cache, if you need to, you're going back and forth too much or using too many files at once).
  Never edit files manually during a session (that'll bust cache). THIS INCLUDES LINT.
  Have a clear goal in mind and keep sessions to as few messages as possible.
  Write / generate markdown files with needed documentation using claude.ai, and save those as files in the repo and tell it to read that file as part of a question.
  I'm at about ~$0.5-0.75 for most "tasks" I give it. I'm not a super heavy user, but it definitely helps me (it's like having a super focused smart intern that makes dumb mistakes).
  If i need to feed it a ton of docs etc. for some task, it'll be more in the few $, rather than < $1. But I really only do this to try some prototype with a library claude doesn't know about (or is outdated).
  For hobby stuff, it adds up - totally.
  For a company, massively worth it. Insanely cheap productivity boost (if developers are responsible / don't get lazy / don't misuse it).
- Implicated 8 months ago
  
  I keep seeing this sentiment and it's wild to me.
  Sure, it might cost a few dollars here and there. But what I've personally been getting from it, for that cost, is so far away from "expensive" it's laughable.
  Not only does it do things I don't want to do, in a _super_ efficient manner. It does things I don't know how to do - contextually, within my own project, such that when it's done I _do_ know how to do it.
  Like others have said - if you're exhausting the context window, the problem is you, not the tool.
  Example, I have a project where I've been particularly lazy and there's a handful of models that are _huge_. I know better than to have Claude read those models into context - that would be stupid. Rather - I tell it specifically what I want to do within those models, give it specific method names and tell it not to read the whole file, rather search for and read the area around the method definition.
  If you _do_ need it to work with very large files - they probably shouldn't be that large and you're likely better off refactoring those files (with Claude, of course) to abstract out where you can and reduce the line count. Or, if anything, literally just temporarily remove a bunch of code from the huge files that isn't relevant to the task so that when it reads it it doesn't have to pull all of that into context. (ie: Copy/paste the file into a backup location, delete a bunch of unrelated stuff in the working file, do your work with claude then 'merge' the changes to the backup file and copy it back)
  If a few dollars here and there for getting tasks done is "too expensive" you're using it wrong. The amount of time I'm saving for those dollars is worth many times the cost and the number of times that I've gotten unsatisfactory results from that spending has been less than 5.
  I see the same replies to these same complaints everywhere - people complaining about how it's too expensive or becomes useless with a full context. Those replies all state the same thing - if you're filling the context, you've already screwed it up. (And also, that's why it's so expensive)
  I'll agree with sibling commenters - have claude build documentation within the project as you go. Try to keep tasks silo'd - get in, get the thing done, document it and get out. Start a new task. (This is dependent on context - if you have to load up the context to get the task done, you're incentivized to keep going rather than dump and reload with a new task/session, thus paying the context tax again - but you also are going to get less great results... so, lesson here... minimize context.)
  100% of the time that I've gotten bad results/gone in circles/gotten hallucinations was when I loaded up the context or got lazy and didn't want to start new sessions after finishing a task and just kept moving into new tasks. If I even _see_ that little indicator on the bottom right about how much context is available before auto-compact I know I'm getting less-good functionality and I need to be careful about what I even trust it's saying.
  It's not going to build your entire app in a single session/context window. Cut down your tasks into smaller pieces, be concise.
  It's a skill problem. Not the tool.
  
  13 replies →

ilaksh 8 months ago

Did you try the same exact test with o3 instead? The mini models are meant for speed.

gklitt 8 months ago

I want to but I’ve been having trouble getting o3 to work - lots of errors related to model selection.

ksec 8 months ago

Sometimes I see in certain areas AI / LLM is absolutely crushing those jobs, a whole category will be gone in next 5 to 10 years as they are already 80 - 90% mark. They just need another 5 - 10% as they continue to get improvement and they are already cheaper per task.

Sometimes I see an area of AI/LLM that I thought even with 10x efficiency improvement and 10x hardware resources which is 100x in aggregate it will still be no where near good enough.

The truth is probably somewhere in the middle. Which is why I dont believe AGI will be here any time soon. But Assisted Intelligence is no doubt in its iPhone moment and continue for another 10 years before hopefully another breakthrough.

enether 8 months ago

there was one post that detailed how those OpenAI models hallucinate and double down on thier mistakes by "lying" - it speculated on a bunch of interesting reasons why this may be the case

I wonder if this is what's causing it to do badly in these cases

victor9000 8 months ago

> I no longer have the “real” prime I generated during that earlier session... I produced it in a throw‑away Python process, verified it, copied it to the clipboard, and then closed the interpreter.
AGI may well be on its way, as the mode is mastering the fine art of bullshitting.

kristopolous 8 months ago

Ever use Komment? They've been in the game a awhile. Looks pretty good