← Back to context

Comment by Frannky

2 hours ago

Opus 4.6 is nuts. Everything I throw at it works. Frontend, backend, algorithms—it does not matter.

I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time. Sometimes ideas are dumb, but checking and guiding step by step helps it ship working things in hours.

It was also the first AI I felt, "Damn, this thing is smarter than me."

The other crazy thing is that with today's tech, these things can be made to work at 1k tokens/sec with multiple agents working at the same time, each at that speed.

I wish I had this kind of experience. I threw a tedious but straightforward task at Claude Code using Opus 4.6 late last week: find the places in a React code base where we were using useState and useEffect to calculate a value that was purely dependent on the inputs to useEffect, and replace them with useMemo. I told it to be careful to only replace cases where the change did not introduce any behavior changes, and I put it in plan mode first.

It gave me an impressive plan of attack, including a reasonable way to determine which code it could safely modify. I told it to start with just a few files and let me review; its changes looked good. So I told it to proceed with the rest of the code.

It made hundreds of changes, as expected (big code base). And most of them were correct! Except the places where it decided to do things like put its "const x = useMemo(...)" call after some piece of code that used the value of "x", meaning I now had a bunch of undefined variable references. There were some other missteps too.

I tried to convince it to fix the places where it had messed up, but it quickly started wanting to make larger structural changes (extracting code into helper functions, etc.) rather than just moving the offending code a few lines higher in the source file. Eventually I gave up trying to steer it and, with the help of another dev on my team, fixed up all the broken code by hand.

It probably still saved time compared to making all the changes myself. But it was way more frustrating.

  • One tip I have is that once you have the diff you want to fix, start a new session and have it work on the diff fresh. They’ve improved this, but it’s still the case that the farther you get into context window, the dumber and less focused the model gets. I learned this from the Claude Code team themselves, who have long advised starting over rather than trying to steer a conversation that has started down a wrong path.

    I have heard from people who regularly push a session through multiple compactions. I don’t think this is a good idea. I virtually never do this — when I see context getting up to even 100k, I start making sure I have enough written to disk to type /new, pipe it the diff so far, and just say “keep going.” I learned recently that even essentials like the CLAUDE.md part of the prompt get diluted through compactions. You can write a hook to re-insert it but it's not done by default.

    This fresh context thing is a big reason subagents might work where a single agent fails. It’s not just about parallelism: each subagent starts with a fresh context, and the parent agent only sees the result of whatever the subagent does — its own context also remains clean.

    • subagents are huge, could execute on a massive plan that should easily fill up a 200k context window and be done atnaround 60k for the orchestration agent.

      as a cheapass, being able to pass off the simple work to cheaper $ per token agents is also just great. I've got a handful of tasks I can happily delegate work to a haiku agent and anything requiring a bit of reasoning goes to sonnet.

      Feel like opus is almost a cheatcode when i do get stuck, i just bust out a full opus workflow instead and it just destroys everything i was struggling with usually. like playing on easy mode.

      as cool as this stuff is, kinda still wish i was just grandfathered into the plan with no weekly limit and only the 5 hour window limits, id just be happily hammering opus blissfully.

  • Branch first so you can just undo. I think this would have worked with sub agents and /loop maybe? Write all items to change to a todo.md. Have it split up the work with haiku sub agents doing 5-10 changes at a time, marking the todos done, and /loop until all are done. You’ll succeed I suspect. If the main claude instance compacts its context - stop and start from where you left off.

    • It actually did automatically break the work up into chunks and launched a bunch of parallel workers to each handle a smaller amount of work. It wasn't doing everything in a single instance.

      The problem wasn't that it lost track of which changes it needed to make, so I don't think checking items off a todo list would have helped. I believe it did actually change all the places in the code it should have. It just made the wrong changes sometimes.

      But also, the claim I was responding to was, "I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time." If I have to tell it how to organize its work and how to keep track of its progress and how to execute all the smaller chunks of work, then I may get good results, but the tool isn't as magical (for me, anyway) as it seems to be for some other people.

      1 reply →

  • If you use eslint and tell it how to run lint in CLAUDE.md it will run lint itself and find and fix most issues like this.

    Definitely not ideal, but sure helps.

  • You’re using it wrong. As soon as it starts going off the rails once you’ve repeated yourself, you drop the whole session and start over.

I find that Opus misses a lot of details in the code base when I want it to design a feature or something. It jumps to a basic solution which is actually good but might affect something elsewhere.

GPT 5.4 on codex cli has been much more reliable for me lately. I used to have opus write and codex review, I now to the opposite (I actually have codex write and both review in parallel).

So on the latest models for my use case gpt > opus but these change all the time.

Edit: also the harness is shit. Claude code has been slow, weird and a resource hog. Refuses to read now standardized .agents dirs so I need symlink gymnastics. Hides as much info as it can… Codex cli is working much better lately.

  • Codex CLI is so much more pleasant to use than CC. I cancelled my CC subscription after the OpenCode thing, but somewhat ironically have recently found myself naturally trying the native Codex CLI client first more often lately over OpenCode.

    Kinda funny how you don't actually need to use coercion if you put in the engineering work to build a product that's competitive on its own technical merits...

I am starting to believe it’s not OPUS but developers getting better at using LLMs across the board. And not realizing they are just getting much better at using these tools.

I also thought it was OPUS 4.5 (also tested a lot with 4.6) and then in February switched to only using auto mode in the coding IDEs. They do not use OPUS (most of the times), and I’m ending up with a similar result after a very rough learning curve.

Now switching back to OPUS I notice that I get more out of it, but it’s no longer a huge difference. In a lot of cases OPUS is actually in the way after learning to prompt more effectively with cheaper models.

The big difference now is that I’m just paying 60-90$ month for 40-50hrs of weekly usage… while I was inching towards 1000$ with OPUS. I chose these auto modes because they don’t dig into usage based pricing or throttling which is a pretty sweet deal.

What kinds of things are you building? This is not my experience at all.

Just today I asked Claude using opus 4.6 to build out a test harness for a new dynamic database diff tool. Everything seemed to be fine but it built a test suite for an existing diff tool. It set everything up in the new directory, but it was actually testing code and logic from a preexisting directory despite the plan being correct before I told it to execute.

I started over and wrote out a few skeleton functions myself then asked it write tests for those to test for some new functionality. Then my plan was to the ask it to add that functionality using the tests as guardrails.

Well the tests didn’t actually call any of the functions under test. They just directly implemented the logic I asked for in the tests.

After $50 and 2 hours I finally got something working only to realize that instead of creating a new pg database to test against, it found a dev database I had lying around and started adding tables to it.

When I managed to fix that, it decided that it needed to rebuild multiple docker components before each test and test them down after each one.

After about 4 hours and $75, I managed to get something working that was probably more code than I would have written in 4 hours, but I think it was probably worse than what I would have come up with on my own. And I really have no idea if it works because the day was over and I didn’t have the energy left to review it all.

We’ve recently been tasked at work with spending more money on Claude (not being more productive the metric is literally spending more money) and everyone is struggling to do anything like what the posts on HN say they are doing. So far no one in my org in a very large tech company has managed to do anything very impressive with Claude other than bringing down prod 2 days ago.

Yes I’m using planning mode and clearing context and being specific with requirements and starting new sessions, and every other piece of advice I’ve read.

I’ve had much more luck using opus 4.6 in vs studio to make more targeted changes, explain things, debug etc… Claude seems too hard to wrangle and it isn’t good enough for you to be operating that far removed from the code.

  • You probably just don't have the hang of it yet. It's very good but it's not a mind reader and if you have something specific you want, it's best to just articulate that exactly as best you can ("I want a test harness for <specific_tool>, which you can find <here>"). You need to explain that you want tests that assert on observable outcomes and state, not internal structure, use real objects not mocks, property based testing for invariants, etc. It's a feedback loop between yourself and the agent that you must develop a bit before you start seeing "magic" results. A typical session for me looks like:

    - I ask for something highly general and claude explores a bit and responds.

    - We go back and forth a bit on precisely what I'm asking for. Maybe I correct it a few times and maybe it has a few ideas I didn't know about/think of.

    - It writes some kind of plan to a markdown file. In a fresh session I tell a new instance to execute the plan.

    - After it's done, I skim the broad strokes of the code and point out any code/architectural smells.

    - I ask it to review it's own work and then critique that review, etc. We write tests.

    Perhaps that sounds like a lot but typically this process takes around 30-45 minutes of intermittent focus and the result will be several thousand lines of pretty good, working code.

    • Yes pretty much my workflow. I also keep all my task.md files around as part of the repo, and they get filled up with work details as the agent closes the gates. At the end of each one I update the project memory file, this ensures I can always resume any task in a few tokens (memory file + task file == full info to work on it).

  • Curious what language and stack. And have people at your company had marginally more success with greenfield projects like prototypes? I guess that’s what you’re describing, though it sounds like it’s a directory in a monorepo maybe?

    • This was in Go, but my org also uses Typescript, and Elixir.

      I’ve had plenty of success with greenfield projects myself but using the copilot agent and opus 4.5 and 4.6. I completely vibecoded a small game for my 4 year old in 2 hours. It’s probably 20% of the way to being production ready if I wanted to release it, but it works and he loves it.

      And yes people have had success with very simple prototypes and demos at work.

  • Similar experience. I use these AI tools on a daily basis. I have tons of examples like yours. In one recent instance I explicitly told it in the prompt to not use memcpy, and it used memcpy anyway, and generated a 30-line diff after thinking for 20 minutes. In that amount of time I created a 10-line diff that didn't use memcpy.

    I think it's the big investors' extremely powerful incentives manifesting in the form of internet comments. The pace of improvement peaked at GPT-4. There is value in autocomplete-as-a-service, and the "harnesses" like Codex take it a lot farther. But the people who are blown away by these new releases either don't spend a lot of time writing code, or are being paid to be blown away. This is not a hockey stick curve. It's a log curve.

    Bigger context windows are a welcome addition. And stuff like JSON inputs is nice too. But these things aren't gonna like, take your SWE job, if you're any good. It's just like, a nice substitute for the Google -> Stack Overflow -> Copy/Paste workflow.

> PRD

Is it Baader-Meinhof or is everyone on HN suddenly using obscure acronyms?

  • It stands for Product Requirements Document, it is something commonly used in project planning and management.

I haven't been able to use Opus in super complex projects (GBs of conversation history, or novel algorithms) - Codex reigns supreme here. However, for simpler apps, I wouldn't trade the speed and polish of Opus.

I had been able to get it into the classic AI loop once.

It was about a problem with calculation around filling a topographical water basin with sedimentation where calculation is discrete (e.g. turn based) and that edge case where both water and sediments would overflow the basin; To make the matter simple, fact was A, B, C, and it oscillated between explanation 1 which refuted C, explanation 2 which refuted A and explanation 3 that refuted B.

I'll give it to opus training stability that my 3 tries using it all consistently got into this loop, so I decided to directly order it to do a brute force solution that avoided (but didn't solve) this problem.

I did feel like with a human, there's no way that those 3 loop would happen by the second time. Or at least the majority of us. But there is just no way to get through to opus 4.6

Opus-4.6 is so far ahead of the rest that I think Anthropic is the winner in winner-take-all

  • Codex doesn't seem that far behind. I use the top model available for api key use and its gotten faster this month even on the max effort level (not like a cheetah - more like not so damn painful anymore). Plus, it also forks agents in parallel - for speed & to avoid polluting the main context. I.e. it will fork explorer agents while investigating (kind of amusing because they're named after famous scientists).

> [...] with multiple agents working at the same time, each at that speed.

Horizontal parallelising of tasks doesn't really require any modern tech.

But I agree that Opus 4.6 with 1M context window is really good at lots of routine programming tasks.

  • Opus helped me brick my RPi CM4 today. It glibly apologized for telling to use an e instead of a 6 in a boot loader sequence.

    Spent an hour or so unraveling the mess. My feeling are growing more and more conflicted about these tools. They are here to stay obviously.

    I’m honestly uncertain about the junior engineers I’m working with who are more productive than they might be otherwise, but are gaining zero (or very little) experience. It’s like the future is a world where the entire programming sphere is dominated by the clueless non technical management that we’ve all had to deal with in small proportion a time or two.

    • > I’m honestly uncertain about the junior engineers I’m working with who are more productive than they might be otherwise, but are gaining zero (or very little) experience.

      Well, (economic) progress means being able to do more with less. A Fordian-style conveyor belt factory can churn out cars with relatively unskilled labour.

      Economising on human capital is economising on a scarce input.

      We had these kinds of shifts before. Compare also how planes used to have a pilot, copilot and flight engineer. We don't have that anymore, but it used to be a place for people to learn. But pilot education has adapted.

      Or check how spreadsheet software has removed a lot of the worst rote work in finance. That change happened perhaps in the 1980s. Finance has adapted.

      > Opus helped me brick my RPi CM4 today. It glibly apologized for telling to use an e instead of a 6 in a boot loader sequence.

      Yes, these things do best when they have a (simulated) environment they can make mistakes in and that can give them clear and fast feedback.

      1 reply →

It's so far the best model that answers my questions about Wolfram language.

That being said it's the only use case for me. I won't subscribe to something that I can't use with third party harness.

Opus 4.6 is AGI in my book. They won’t admit it, but it’s absolutely true. It shows initiative in not only getting things right but also adding improvements that the original prompt didn't request that match the goals of the job.

  • I don’t know if Opus is AGI but on a broader note, that’s how we will get AGI. Not some consciousness like people are expecting. It’s just going to be chatbot that’s very hard to stump and starts making actual scientific breakthroughs and solving long standing problems.

    • I'll be more likely to agree with anything being AGI if it doesn't have such obvious and common brittleness. These LLMs all go off the rails when the context window gets large. Their context is also easy to "poison", and so it's better to rollback conversations that went bad rather than trying to steer them back to the light.

      There's probably more examples, but to me AGI must move beyond the above issues. Though frankly context window might just be a symptom of poor harness than anything, still - it illustrates my general issue with them being considered AGI as it stands today.

      Claude 4.6 is getting crazy good though, i'll give you that.

I’ll put out a suggestion you pair with codex or deepthink for audit and review - opus is still prone to … enthusiastic architectural decisions. I promise you will be at least thankful and at most like ‘wtf?’ at some audit outputs.

Also shout out to beads - I highly recommend you pair it with beads from yegge: opus can lay out a large project with beads, and keep track of what to do next and churn through the list beautifully with a little help.

  • I've been pairing it with Codex using https://github.com/pjlsergeant/moarcode

    The amount of genuine fuck-ups Codex finds makes me skeptical of people who are placing a lot of trust in Claude alone.

    • Nice. Yeah I have them connect through beads, which combined with a git log is a lot of information - it feels smoother to me than this looks. But I agree with the sentiment. Codex isn't my favorite for understanding and implementing. But I appreciate the intelligence and pickiness very much.

Just yesterday I asked it to repeat a very simple task 10 times. It ended up doing it 15 times. It wasn't a problem per se, just a bit jarring that it was unable to follow such simple instructions (it even repeated my desire for 10 repetitions at the start!).

I had Opus 4.6 running on a backend bug for hours. It got nowhere. Turned out the problem was in AWS X-ray swizzling the fetch method and not handling the same argument types as the original, which led to cryptic errors.

I had Opus 4.6 tell me I was "seeing things wrong" when I tried to have it correct some graphical issues. It got stuck in a loop of re-introducing the same bug every hour or so in an attempt to fix the issue.

I'm not disagreeing with your experience, but in my experience it is largely the same as what I had with Opus 4.5 / Codex / etc.

  • Haha, reminds me of an unbelievably aggravating exchange with Codex (GPT 5.4 / High) where it was unflinchingly gaslighting me about undesired behavior still occurring after a change it made that it was adamant simply could not be happening.

    It started by insisting I was repeatedly making a typo and still would not budge even after I started copy/pasting the full terminal history of what I was entering and the unedited output, and eventually pivoted to darkly insinuating I was tampering with my shell environment as if I was trying to mislead it or something.

    Ultimately it turned out that it forgot it was supposed to be applying the fixes to the actual server instead of the local dev environment, and had earlier in the conversation switched from editing directly over SSH to pushing/pulling the local repo to the remote due to diffs getting mangled.

The replies to this really make me think that some people are getting left behind the AI age. Colleges are likely already teaching how to prompt, but a lot of existing software devs just don't get it. I encourage people who aren't having success with AI to watch some youtube videos on best practices.