Comment by subhobroto

3 hours ago

> Cursor's benchmark finds that Cursor's model (Composer 2.5) is basically as good as Opus 4.8 max and GPT-5.5 xhigh, but at a fraction of the price.

Your skepticism is well-founded IMHO. I have found that if you are one-shotting a Django/Next CRUD app, a typical React/Vue UI, shell scripts or GitHub Actions, Composer 2.5 is fantastic!

But for anything outside the median of the last decade's web development - like free-body physics, kinematics, or optimization - Composer is horribly unpredictable.

That's what makes it _dangerous_ IMHO.

It isn't universally trash! Rather, it confidently makes subtle, incorrect assumptions. It will hallucinate formulas that don't appear in your specification and design docs. Then write tests that pass it.

It inserts tiny footguns that require you to scrutinize every single token it generates. At that point, I would rather be coding by hand.

Opus 4.8 max, on the other hand, refuses to guess, atleast the way I have set it up. If there's any ambiguity about the implementation or how tests should be written, it stops and asks me for clarification. I actually trust the output without worrying about hidden disasters and ticking timebombs. I can confidently review the test suite, add a few edge cases on my own, spot check the code and be comfortable knowing there are no disastrous footguns lurking in the shadows only to come out in the darkness of production deployments.

Let me repeat - Opus 4.8 max stops and asks me for clarification. It writes the tests I would have written. It writes tests that fail, exposing gaps and errors, that then allows me to iterate.

Composer 2.5 OTOH will run with whatever it decides I meant and write something that steals productivity, not add to it.

Same harness (Cursor), same rules, same prompts, vastly different outcomes!

Yes, Opus is far more expensive, but it's worth it for the time saved on review and refactors, which are our current blockers.

The real friction is that Cursor's marketing is so aggressive that the people paying the bills look at my Opus usage and demand to know why I'm not using the cheaper alternative!

It's an impossible argument to win when the rest of the company's devs are happily building standard web apps on Composer without issue, blissfully unaware of how the model not only falls apart but is just unreliable on harder engineering problems.

Fable 5 is on a league on its own. If history in the LLM space is any predictor of the future, in ~6 months (Q1 2027) we should have open weight models that are competitive with Fable 5. Without considering what it will take to run such a thing, I would be extremely excited to have open access to such a capability. Great times ahead!