← Back to context

Comment by strix_varius

19 hours ago

Not OP but I've been thinking about this a lot (like everyone ha) and I think my answer is, yes?

I hope there's a "good enough" point but I don't think we're there yet. Like for me hardware got good enough several years ago. But while opus 4.7 is really good compared to everything else, it's not so good that I would use it at a discount over whatever is available in a few months. The improvement in quality, speed, and daily frustration is worth it to me... Spoken as someone whose employer is footing the bill, so take that with a grain of salt.

I want to run my own local models, but I don't think that's feasible without lots of frustration until a few generations of frontier models are so good that they're almost indistinguishable for common tasks. Kind of like how MacBook pros have been for a while.

Why should I need to talk to Opus 4.7 when my day-to-day task is about programming in Java and Python? I don't need my model to know about biology or chemistry. If I need those capabilities (for someone who is working as software engineer in chemical industry), I will talk to Opus 4.7 for planning and then fan-out work for cheaper coding models. I think we will soon start to see specialized highly effective English language only programming models. I don't need my coding model to know about literature, art, philosphy, ethics, etc.

  • I would think that the surrounding chemical "knowledge" could be useful in the context of programming in that industry. Have you ever found it to draw links and conclusions between what you're doing in computer science and the chemistry it's in the middle of?

While I can imagine that I'd want to use Opus 4.8 over 4.6 for a fair number of things (at least if they can avoid further speed regressions), I also have noticed that certain types of failures seem to be systemic. Bigger context has been helpful for bootstrapping, but still doesn't fix problems of getting stuck on the wrong things - you can toss more things in the blender, but you don't necessarily know which way it'll slice them up in advance, or which things from them it'll latch onto. And output still seems to get into "blindered" states where important details get dropped - even though it'll agree very quickly when you point that out. As long as we're in that sort of "spit something out in local targeted manner, and then do a revision loop until tests are green" style of execution, bigger models haven't shown me the ability to really avoid finding non-optimal / subtly-broken outputs for complex problems.

Using Cursor to hop between models, I've found Opus to be generally better at really tricky debugging than GPT 5.5 or earlier models, but not reliably better at execution because of these things. I'm not sure Composer 2.5 is quite there yet for the execution side, but it's getting pretty close to those other ones, such that I'm definitely still in a "debug and plan with slow, execute with faster ones" operating model for working on hard shit.