← Back to context

Comment by mohsen1

6 days ago

Yes indeed. More than 1 million lines of code (including tests) is jumping lots of hoops but with LLMs it's not as painful so you can just ask it to do the hard things.

Example of a Claude Code session after 2 hours of "Crunching" that came out without results https://github.com/mohsen1/tsz/pull/4868 (Edit I force pushed to PR to solve the problem, you can see the initial refuse message in the initial version of PR description)

Funny thing is, the last percent of the test have been so hard to work on that Opus 4.7 routinely bails and says "it's too involved or complicated" so I had to add prompts specifically asking it not to bail.

You should try GPT, I’d be really interested to hear if it works better. (Exclusively using GPT for systems work at $DAYJOB, but compare with opus every couple weeks and GPT consistently gives me better results)

  • I've been comparing Claude vs Codex using GPT and Claude consistently is better than GPT about reasoning, about writing code, and using the tools as appropriate.

    GPT for instance had a lot of issues using git worktrees, and didn't understand how to correctly use it to then merge stuff back into a main branch, vs Claude which seems to do this much more naturally.

    GPT also left me with broken tests/code that I had to iterate on manually, Claude is much better about reasoning through code. Primarily Python.

    • > GPT for instance had a lot of issues using git worktrees, and didn't understand how to correctly use it to then merge stuff back into a main branch, vs Claude which seems to do this much more naturally.

      I wonder how much of that is due to the model being somehow better, or the harness having built-in instructions on how to use them.

      I've used worktrees with Codex just fine, but I instructed it to use my scripts for setting it up and tearing it down. The scripts also reflinked existing compilation artifacts to speed up compiling and allocated a fresh db instance for it, but then also applied a simple protocol for locking the master repository during merges, so multiple agents wouldn't try to merge at the same time. It has been following those instructions quite well.

  • OpenAI gave me that 10x boost and used it all already for this week. I'm guessing the last 50 tests is only doable by GPT 5.5 xhigh

That might be opus 4.7 behaviour because I also get that all the time in the past few weeks. Also complex code base, but likely an order of magnitude simpler than yours.