Comment by fnands
20 hours ago
When was the last time you tried?
I think trying agents to do larger tasks was always very hit or miss, up to about the end of last year.
In the past couple of months I have found them to have gotten a lot better (and I'm not the only one).
My experience with what coding assistants are good for shifted from:
smart autocomplete -> targeted changes/additions -> full engineering
I’m not OP but every time I post a comment with this sentiment I get told “the latest models are what you need”. If every 3 months you are saying “it’s ready as long as you use the latest model”, then it wasn’t ready 3 months ago and it’s not likely to be ready now.
To answer your question, I’ve tried both Claude code and Antigravity in the last 2 weeks and I’m still finding them struggling. AG with Gemini regularly gets stuck on simple issues and loops until I run out of requests, and Claude still just regularly goes on wild tangents not actually solving the problem.
I don’t think that’s true. Claude Opus 4.5/4.6 in Cursor have marked the big shift for me. Before that, agentic development mostly made me want to just do it myself, because it was getting stuck or going on tangents.
I think it can (and is) shifting very rapidly. Everyone is different, and I’m sure models are better at different types of work (or styles of working), but it doesn’t take much to make it too frustrating to use. Which also means it doesn’t take much to make it super useful.
> I don’t think that’s true. Claude Opus 4.5/4.6 in Cursor.
Opus 4.6 has been out for less than a month. If it was a big shift surely we'd see a massive difference over 4.5 which was november. I think this proves the point, you're not seeing seisimic shifts every 3 months and you're not even clear about which model was the fix.
> I think it can (and is) shifting very rapidly.
Shifting, maybe. But shuffling deck chairs every 3 months.
1 reply →
It depends on what you're handling. Frontend (not css), swagger, mundane CRUD is where it shines. Something more complex that need a bit harder calculation usually make the agents struggling.
Especially good to navigate the code if you're unfamiliar with it (the code). If you have known the code for good, you'll find it's usually faster to debug and code by yourself.
Opus 4.6 with claude code vscode extension
Have you tried it with something like OpenSpec? Strangely, taking the time to lay out the steps in a large task helps immensely. It's the difference between the behavior you describe and just letting it run productively for segments of ten or fifteen minutes.
> Have you tried it with something like OpenSpec?
No. The parent comment said I needed a new model, which I've tried. Being told "just try something else aswell" kind of proves the point.
I thought this too and then I discovered plan mode. If you just prompt agent mode it will be terrible, but coming up with a plan first has really made a big difference and I rarely write code at all now
My workflow has become very plan-intensive... including planning of verification+test steps at the end.
Agree, it’s strange, I will just assume that the people who say this are building react apps. I still have so much ”certainly, I should not do this in a completely insane way, let me fix that” … -400+2. It’s not always, and it is better than it was, but that’s it.
I'm an ML engineer, so it's mostly been setting up data processing/training code in PyTorch, if that helps.
At this point though, after Claude C Compiler, you've got to give us more details to better understand the dichotomy. What do you consider simple issues?
> At this point though, after Claude C Compiler,
Perfect example. You mean the C compiler that literally failed to compile a hello world [0] (which was given in it's readme)?
> What do you consider simple issues?
Hallucinating APIs for well documented libraries/interfaces, ignoring explicit instructions for how to do things, and making very simple logic errors in 30-100 line scripts.
As an example, I asked Claude code to help me with a Roblox game last weekend, and specifically asked it to "create a shop GUI for <X> which scales with the UI, and opens when you press E next to the character". It proceeded to create a GUI with absolute sizings, get stuck on an API hallucination for handling input, and also, when I got it unstuck, it didn't actually work.
[0] https://github.com/anthropics/claudes-c-compiler/issues/1
Claude C compiler is 100k LOC that doesn’t do anything useful, and cost $20k plus the cost of an expert engineer creating a custom harness and babysitting it.
But the most important thing is that they were reverse engineering gcc by using it as an oracle. And it had gcc and thousands of other c compilers in its training set.
So if you are a large corporation looking to copy GPL code so that you can use it without worrying about the license, and the project you want to copy is a text transformer with a rigorously defined set of inputs and outputs, have at it.
> When was the last time you tried?
Pretty recently (a couple weeks ago). I give agentic workflows a go every couple of weeks or so.
I should say, I don't find them abysmal, but I tend to work in codebases where I understand them, and the patterns really well. The use cases I've tried so far, do sort of work, just not yet at least, faster than I'm able to actual write the code myself.
> My experience with what coding assistants are good for shifted from:
> smart autocomplete -> targeted changes/additions -> full engineering
Define "full engineering". Because if you say "full engineering" I would expect the agent to get some expected product output details as input and produce all by itself the right implementation for the context (i.e. company) it lives in.
I agree that "full engineering" was a bit broad. I should probably have said something like "agent-only coding"?
I.e. the point where the agent writes all the code and you just verify.
The "you just verify" part can take indeed a lot of steering and hand-holding to get the right implementation for the current company/department/project context. Otherwise you might be just generating tech debt at scale.