Comment by mdasen

13 hours ago

I'm a bit skeptical.

Cursor's benchmark finds that Cursor's model (Composer 2.5) is basically as good as Opus 4.8 max and GPT-5.5 xhigh, but at a fraction of the price.

Artificial Analysis' testing shows Composer 2.5 to be pretty far behind: https://artificialanalysis.ai/agents/coding-agents. You look at the DeepSWE benchmark (which is probably the hardest to game at this point) and GPT-5.5 xhigh gets a 64, Opus 4.8 max gets 56, and Cursor 2.5 gets 16.

I don't doubt that Cursor works well for some people. It's beating DeepSeek v4 Pro in the DeepSWE benchmark and that's a very capable model. But I'm skeptical of the claims that it's a competitor for Opus 4.8 and GPT-5.5. It just seems convenient that their model does so well on their own benchmark while third party benchmarks have it far behind. Maybe it's a really great benchmark and a better measure than third party ones - I'd love for a cheap model to do as well as the expensive ones.

(I work at Cursor) When Composer 2.5 launched, we initially scored very competitively on AA's composite benchmark. I believe 3rd place overall. They have recently updated to use DeepSWE, which has more of a focus on very long-horizon tasks, and Composer isn't as good at those yet. We're aware and working on this for our next model.

Overall, some benchmarks show Composer doing well, others not so much. We think the model is very capable at the given price point. There's lots to improve! If you see any specific behaviors or places the model isn't very good, lmk here or can email me lrobinson at cursor.com.

  • > We think the model is very capable at the given price point.

    The "price point" comparison is a lie though because Composer is only available with a monthly Cursor subscription, and Cursor's external-model-per-token charges for other models are not representative of what other models' monthly subscribers get. An OpenAI $200 subscription gets you at least as much GPT 5.5 as a $200 Cursor subscription gets you Composer 2.5.

  • How does it compare to a $100 Claude subscription at $60? Especially in terms of how much of it I can use, because I havent found anything that is in the US that can get me similar usage as Claude at $100 per month or less, really open to alternatives.

    Grok build only gave me roughly 10 hours of use for $40 for the entire month...

    I don't even care about long horizon, can I use it a reasonable amount of time through the month? I use AI for hobby projects, Claude gets me quite far, but I tire of dropping $100 every month. I'm not sending my money to some Chinese firm that now has access to my computer.

  • Even with the new benchmark, Composer 2.5 seems to be just a bit worse than Opus 4.7. So I assume it's going to be about similar with Sonnet 5.0 at 1/6 of the cost.

Not hard to understand what's going on here. They RL'd around patterns in their data and specific capabilities, so of course they'd construct a benchmark that's aligned with the training set.

Ironically, their benchmark might be more accurate than artificial analysis for a narrow slice of things that Cursor's Eigencustomer is really interested in. Otherwise I'd take it as just another data point.

  • (I work at Cursor) CursorBench includes many evals from actual engineering tasks from the Cursor team, which include our private codebase. This codebase is held-out from training so models haven't seen it, including Composer.

I can't speak to benchmarks, but I have used Composer 2.5 extensively and it's performed quite well in my real world tasks.

DeepSWE is slightly flawed in the sense that is uses only its own harness and that causes issues on models that are not correctly supported by it. There's huge amount of evidence that the harness plays a big role in how these models work and yet DeepSWE entirely removes that (and has probably only tested that it works fine with some favourite model of them).

There's also issues with cost calculation (as that harness doesn't use caches) and so on as reported on their github issues.

None of the benchmarks are perfect, but that does explain a lot of the variations between benchmarks.

  • I think DeepSWE is flawed in a different way: the tasks look like someone took a bunch of big highly technical PRs they found really well done, and inverted it into specs for agents to autistically execute. This is not really how people use agents in practice IMO. And it's why DeepSWE is so generous to OAI models, rigid task execution is the thing they're best at. I think FrontierCode matches the vibes a lot better.

Cursor sessions are pretty much what composer models are RL'd on. This bench and the training data are/should be basically the same distribution.

Anecdotally, I find Composer 2.5 to be useless. I do use light LLMs like Claude Haiku and some of Cursor's older free models, but Composer is negative productivity for me.

  • The opposite , I use for everything like trigger and monitor a 10 steps release process using composer , a very capable model

    • this is my finding too, i have moved to it fully for most of the plan/coding.

      for most tasks is capable and very cheap, for a days worth of tasks is costing about $10

      1 reply →

> Cursor's benchmark finds that Cursor's model (Composer 2.5) is basically as good as Opus 4.8 max and GPT-5.5 xhigh, but at a fraction of the price.

Your skepticism is well-founded IMHO. I have found that if you are one-shotting a Django/Next CRUD app, a typical React/Vue UI, shell scripts or GitHub Actions, Composer 2.5 is fantastic!

But for anything outside the median of the last decade's web development - like free-body physics, kinematics, or optimization - Composer is horribly unpredictable.

That's what makes it _dangerous_ IMHO.

It isn't universally trash! Rather, it confidently makes subtle, incorrect assumptions. It will hallucinate formulas that don't appear in your specification and design docs. Then write tests that pass it.

It inserts tiny footguns that require you to scrutinize every single token it generates. At that point, I would rather be coding by hand.

Opus 4.8 max, on the other hand, refuses to guess, atleast the way I have set it up. If there's any ambiguity about the implementation or how tests should be written, it stops and asks me for clarification. I actually trust the output without worrying about hidden disasters and ticking timebombs. I can confidently review the test suite, add a few edge cases on my own, spot check the code and be comfortable knowing there are no disastrous footguns lurking in the shadows only to come out in the darkness of production deployments.

Let me repeat - Opus 4.8 max stops and asks me for clarification. It writes the tests I would have written. It writes tests that fail that then allows me to iterate.

Composer 2.5 OTOH will run with whatever it decides I meant and write something that steals productivity, not add to it.

Same harness (Cursor), same rules, same prompts, vastly different outcomes!

Yes, Opus is far more expensive, but it's worth it for the time saved on review, which is our current blocker.

The real friction is that Cursor's marketing is so aggressive that the people paying the bills look at my Opus usage and demand to know why I'm not using the cheaper alternative!

It's an impossible argument to win when the rest of the company's devs are happily building standard web apps on Composer without issue, blissfully unaware of how the model not only falls apart but is just unreliable on harder engineering problems.

Fable 5 is on a league on its own. If history in the LLM space is a predictor of the future, in ~6 months we should have open weight models that are competitive with Fable 5. Without considering what it will take to run such a thing, I would be extremely excited to have open access to such a capability. Great times ahead!

For lighter interactive agentic coding, where you type stuff into an IDE and a minute or three later get results back for review, composer 2.5 is honestly pretty great. The results get notably worse for larger tasks though.

  • Agreed. It’s worse than Opus of course. But Opus takes more than 10x longer to give you something to look at. I’m not kidding, I “benchmarked” a real ticket I was working on. Opus 4.7 took more than 30min. Opus 4.8 took over an hour. Composer 2.5 took 5min on the exact same prompt & local setup. My subjective review is that composer’s code was only like 10-20% worse. It still worked, it was just a bit less clean and a little more hacky. But it’s not like Opus is flawless either. At the end of the day, if it takes an hour to get to draft code I can look at and iterate on… that’s fucking impossible for me. Unless it did an excellent job. But as long as I still need to review and follow up with changes, Opus is just too slow. It’s really frustrating because it’s a lot slower than it was 6mo ago, and not noticeably better. Fable seems a step in the right direction but is $$$$

that benchmark seems to match my experience. GPT 5.5 is significantly better than Opus 4.8, last time I tried composer 2.5 it was truly dumb, and Fable to me looks to be on par with GPT 5.5 but .. different overall ... The best is to have a LLM-peer-review between GPT and Opus (now Fable) for best outcome.

Composer writes the worst, stupidest, most naive and straight up brains-dead code you could imagine. Fast and cheap is about all it’s got going for it. I mostly use it for “sort these lines alphabetically” and stuff that’s a smidge too complex for regex find/replace.

  • It’s starting to feel like people need to say what language/stack and problem space they’re working in. It would be interesting to see why we’re seeing such wild variance.

  • I primarily use composer. I wanted to build something from scratch recently and, thinking I was missing out on something, I got Opus to build it. I wasn't blown away. I gave the same prompts to composer and the code it came up with different but similar in quality. I ended up progressing with the composer code because it was easier to progress with improvements due to its faster response time.

I mean, they train their model on their training data. So by it should score well on their own benchmark.