Comment by mcv

2 months ago

Opus 4.5 ate through my Copilot quota last month, and it's already halfway through it for this month. I've used it a lot, for really complex code.

And my conclusion is: it's still not as smart as a good human programmer. It frequently got stuck, went down wrong paths, ignored what I told it to do to do something wrong, or even repeat a previous mistake I had to correct.

Yet in other ways, it's unbelievably good. I can give it a directory full of code to analyze, and it can tell me it's an implementation of Kozo Sugiyama's dagre graph layout algorithm, and immediately identify the file with the error. That's unbelievably impressive. Unfortunately it can't fix the error. The error was one of the many errors it made during previous sessions.

So my verdict is that it's great for code analysis, and it's fantastic for injecting some book knowledge on complex topics into your programming, but it can't tackle those complex problems by itself.

Yesterday and today I was upgrading a bunch of unit tests because of a dependency upgrade, and while it was occasionally very helpful, it also regularly got stuck. I got a lot more done than usual in the same time, but I do wonder if it wasn't too much. Wasn't there an easier way to do this? I didn't look for it, because every step of the way, Opus's solution seemed obvious and easy, and I had no idea how deep a pit it was getting me into. I should have been more critical of the direction it was pointing to.

45 comments

mcv

hawtads 2 months ago

Copilot and many coding agents truncates the context window and uses dynamic summarization to keep costs low for them. That's how they are able to provide flat fee plans.

You can see some of the context limits here:

https://models.dev/

If you want the full capability, use the API and use something like opencode. You will find that a single PR can easily rack up 3 digits of consumption costs.

verdverm 2 months ago

Gerring off of their plans and prompts is so worth it, I know from experience, I'm paying less and getting more so far, paying by token, heavy gemini-3-flash user, it's a really good model, this is the future (distillations into fast, good enough for 90% of tasks), not mega models like Claude. Those will still be created for distillations and the harder problems
mcv 2 months ago
Maybe not, then. I'm afraid I have no idea what those numbers mean, but it looks like Gemini and ChatGPT 4 can handle a much larger context than Opus, and Opus 4.5 is cheaper than older versions. Is that correct? Because I could be misinterpreting that table.
- esperent 2 months ago
  
  I don't know about GPT4 but the latest one (GPT 5.2) has 200k context window while Gemini has 1m, five times higher. You'll be wanting to stay within the first 100k on all of them to avoid hitting quotas very quickly though (either start a new task or compact when you reach that) so in practice there's no difference.
  I've been cycling between a couple of $20 accounts to avoid running out of quota and the latest of all of them are great. I'd give GPT 5.2 codex the slight edge but not by a lot.
  The latest Claude is about the same too but the limits on the $20 plan are too low for me to bother with.
  The last week has made me realize how close these are to being commodities already. Even the CLI the agents are nearly the same bar some minor quirks (although I've hit more bugs in Gemini CLI but each time I can just save a checkpoint and restart).
  The real differentiating factor right now is quota and cost.
  
  4 replies →
- cma 2 months ago
  
  You need to find where context breaks down, Claude was better at it even when Gemini had 5X more on paper, but both have improved with last releases.

deanc 2 months ago

People are completely missing the points about agentic development. The model is obviously a huge factor in the quality of the output, but the real magic lies in how the tools are managing and injecting context in to them, as well as the tooling. I switched from Copilot to Cursor at the end of 2025, and it was absolute night and day in terms of how the agents behaved.

port3000 2 months ago
Interesting you have this opinion yet you're using Cursor instead of Claude Code. By the same logic, you should get even better results directly using Anthropic's wrapper for their own model.
- deanc 2 months ago
  
  My employer doesn't allow for Claude Code yet. I'm fully aware from speaking to other peers, that they are getting even better performance out of Claude Code.
causal 2 months ago

In my experience GPT-5 is also much more effective in the Cursor context than the Codex context. Cursor deserves props for doing something right under the hood.

zmmmmm 2 months ago

yes just using AI for code analysis is way under appreciated I think. Even the most sceptical people on using it for coding should try it out as a tool for Q&A style code interrogation as well as generating documentation. I would say it zero-shots documentation generation better than most human efforts would to the point it begs the question of whether it's worth having the documentation in the first place. Obviously it can make mistakes but I would say they are below the threshold of human mistakes from what I've seen.

sfink 2 months ago
(I haven't used AI much, so feel free to ignore me.)
This is one thing I've tried using it for, and I've found this to be very, very tricky. At first glance, it seems unbelievably good. The comments read well, they seem correct, and they even include some very non-obvious information.
But almost every time I sit down and really think about a comment that includes any of that more complex analysis, I end up discarding it. Often, it's right but it's missing the point, in a way that will lead a reader astray. It's subtle and I really ought to dig up an example, but I'm unable to find the session I'm thinking about.
This was with ChatGPT 5, fwiw. It's totally possible that other models do better. (Or even newer ChatGPT; this was very early on in 5.)
Code review is similar. It comes up with clever chains of reasoning for why something is problematic, and initially convinces me. But when I dig into it, the review comment ends up not applying.
It could also be the specific codebase I'm using this on? (It's the SpiderMonkey source.)
- zmmmmm 2 months ago
  
  My main experience is with anthropic models.
  I've had some encounters with inaccuracies but my general experience has been amazing. I've cloned completely foreign git repos, cranked up the tool and just said "I'm having this bug, give me an overview of how X and Y work" and it will create great high level conceptual outlines that mean I can drive straight in where without it I would spend a long time just flailing around.
  I do think an essential skill is developing just the right level of scepticism. It's not really different to working with a human though. If a human tells me X or Y works in a certain way i always allow a small margin of possibility they are wrong.
  
  1 reply →
- mcv 2 months ago
  
  They do have a knack for missing the point. Even Opus 4.5 can laser focus on the wrong thing. It does take skill and experience to interpret them correctly and set them straight when they go wrong.
  Even so, for understanding what happens in a big chunk of code, they're pretty great.

josu 2 months ago

>So my verdict is that it's great for code analysis, and it's fantastic for injecting some book knowledge on complex topics into your programming, but it can't tackle those complex problems by itself.

I don't think you've seen the full potential. I'm currently #1 on 5 different very complex computer engineering problems, and I can't even write a "hello world" in rust or cpp. You no longer need to know how to write code, you just need to understand the task at a high level and nudge the agents in the right direction. The game has changed.

- https://highload.fun/tasks/3/leaderboard

- https://highload.fun/tasks/12/leaderboard

- https://highload.fun/tasks/15/leaderboard

- https://highload.fun/tasks/18/leaderboard

- https://highload.fun/tasks/24/leaderboard

johndough 2 months ago
All the naysayer here have clearly no idea. Your large matrix multiplication implementation is quite impressive! I have set up a benchmark loop and let GPT-5.1-Codex-Max experiment for a bit (not 5.2/Opus/Gemini, because they are broken in Copilot), but it seems to be missing something crucial. With a bit of encouragement, it has implemented:
- padding from 2000 to 2048 for easier power-of-two splitting - two-level Winograd matrix multiplication with tiled matmul for last level - unrolled AVX2 kernel for 64x64 submatrices - 64 byte aligned memory - restrict keyword for pointers - better compiler flags (clang -Ofast -march=native -funroll-loops -std=c++17)
But yours is still easily 25 % faster. Would you be willing to write a bit about how you set up your evaluation and which tricks Claude used to solve it?
- josu 2 months ago
  
  Thank you. Yeah, I'm doing all those things, which do get you close to the top. The rest of things I'm doing are mostly micro-optimizations such as finding a way to avoid AVX→SSE transition penalty (1-2% improvement).
  But I don't want to spoil the fun. The agents are really good at searching the web now, so posting the tricks here is basically breaking the challenge.
  For example, chatGPT was able to find Matt's blog post regarding Task 1, and that's what gave me the largest jump: https://blog.mattstuchlik.com/2024/07/12/summing-integers-fa...
  Interestingly, it seems that Matt's post is not on the training data of any of the major LLMs.
zarzavat 2 months ago
How are you qualified to judge its performance on real code if you don't know how to write a hello world?
Yes, LLMs are very good at writing code, they are so good at writing code that they often generate reams of unmaintainable spaghetti.
When you submit to an informatics contest you don't have paying customers who depend on your code working every day. You can just throw away yesterday's code and start afresh.
Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash.
- josu 2 months ago
  
  I know what's like running a business, and building complex systems. That's not the point.
  I used highload as an example because it seems like an objective rebuttal to the claim that "but it can't tackle those complex problems by itself."
  And regarding this:
  "Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash"
  Again, a combination of LLM/agents with some guidance (from someone with no prior experience in this type of high performing architecture) was able to beat all human software developers that have taken these challenges.
- VMG 2 months ago
  
  > Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash.
  The skill of "a human software developer" is in fact a very wide distribution, and your statement is true for a ever shrinking tail end of that
- FeepingCreature 2 months ago
  
  > How are you qualified to judge its performance on real code if you don't know how to write a hello world?
  The ultimate test of all software is "run it and see if it's useful for you." You do not need to be a programmer at all to be qualified to test this.
  
  5 replies →
throw1235435 2 months ago

If that is true; then all the commentary around software people having jobs still due to "taste" and other nice words is just that. Commentary. In the end the higher level stuff still needs someone to learn it (e.g. learning ASX2 architecture, knowing what tech to work with); but it requires IMO significantly less practice then coding which in itself was a gate. The skill morphs more into a tech expert rather than a coding expert.
I'm not sure what this means for the future of SWE's though yet. I don't see higher levels of staff in big large businesses bothering to do this, and at some scale I don't see founders still wanting to manage all of these agents, and processes (got better things to do at higher levels). But I do see the barrier of learning to code gone; meaning it probably becomes just like any other job.
ModernMech 2 months ago
None of the problems you've shown there are anything close to "very complex computer engineering problems", they're more like "toy problems with widely-known solutions given to students to help them practice for when they encounter actually complex problems".
- josu 1 month ago
  
  I think you misunderstood, it's not about solving the problem, is about finding the most efficient solution. Give it a shot, and see if you can get to the top 10 on any task.
  
  1 reply →
dajoh 2 months ago
>I'm currently #1 on 5 different very complex computer engineering problems
Ah yes, well known very complex computer engineering problems such as:
* Parsing JSON objects, summing a single field
* Matrix multiplication
* Parsing and evaluating integer basic arithmetic expressions
And you're telling me all you needed to do to get the best solution in the world to these problems was talk to an LLM?
- josu 2 months ago
  
  Lol, the problem is not finding a solution, the problem is solving it in the most efficient way.
  If you think you can beat an LLM, the leaderboard is right there.

yieldcrv 2 months ago

It acts differently when using it through a third party tool

Try it again using Claude Code and a subscription to Claude. It can run as a chat window in VS Code and Cursor too.

mcv 2 months ago
My employer gets me a Copilot subscription with access to Claude, not a subscription to Claude Code, unfortunately.
- yieldcrv 2 months ago
  
  at this point I would suggest getting a $20 subscription to start, seeing if you can expense it
  the tooling is almost as important as the model
  
  2 replies →

NSPG911 2 months ago

> Opus 4.5 ate theough my Copilot quota last month

Sure, Copilot charges 3x tokens for using Opus 4.5, but, how were you still able to use up half the allocated tokens not even one week into January?

I thought using up 50% was mad for me (inline completions + opencode), that's even worse

mcv 2 months ago

I have no idea. Careless use, I guess. I was fixing a bunch of mocks in some once-great but now poorly maintained code, and I wasn't really feeling it so I just fed everything to Claude. Opus, unfortunately. I could easily have downgraded a bit.

Davidzheng 2 months ago

If it can consistently verify that the error persists after fix--you can run (ok maybe you can't budget wise but theoretically) 10000 parallel instances of fixer agents then verify afterwards (this is in line with how the imo/ioi models work according to rumors)