Comment by darkxanthos
5 days ago
I stumbled into Agentic Coding in VS Code Nightlys with co-pilot using Claude Sonnet 4 and I've been silly productive. Even when half my day is meetings, you wouldn't be able to tell from my git history.
My thinking now is removed from the gory details and is a step or two up. How can I validate the changes are working? Can I understand this code? How should it be structured so I can better understand it? Is there more we can add to the AI conventions markdown in the repo to guide the Agent to make fewer mistaken assumptions?
Last night I had a file with 38 mypy errors. I turned it over to the agent and went and had a conversation with my wife for 15 minutes. I came back, it summarized the changes it made and why, I debated one of the changes with it but ultimately decided it was right.
Mypy passed. Good to go.
I'm currently trying to get my team to really understand the power here. There's a lot of skeptics and the AI still isn't perfect and people who are against the AI era will latch onto that as validation but it's exactly opposite the correct reaction. It's really validation because as a friend of mine says
"Today is the worst day you will have with this technology for the rest of your life."
> Last night I had a file with 38 mypy errors
Fixing type checker errors should be one the least time consuming things you do. This was previously consuming a lot of your time?
A lot of the AI discourse would be more effective if we could all see the actual work one another is doing with it (similar to the cloudflare post).
> AI discourse would be more effective if we could all see the actual work one another is doing with it
Yes, this is a frequent problem both here and everywhere else. The discussions need to include things like exact model version, inference parameters, what system prompt you used, what user prompt, what code you gave it, what exactly it replied and so much more details, as currently almost every comment is "Well, I used Sonnet last week and it worked great" without any details. Not to mention discussions around local models missing basic stuff like what quantization (if any) and what hardware you're running it on. People just write out "Wow fast model" or stuff like that, and call it a day.
Although I understand why, every comment be huge if everyone always add sufficient context. I don't know the solution to this, but it does frustrate me.
There's many examples of exactly what you're asking for, such as Kenton Varda's Cloudlfare oauth provider [1] and Simon Willison's tools [2]. I see a new blog post like this with detailed explanations of what they did pretty frequently, like Steve Klabnik's recent post [3], which while it isn't as detailed has a lot of very concrete facts. There's even more blog posts from prominent devs like antirez who talk about other things they're doing with AI like rubber ducking [4], if you're curious about how some people who say "I used Sonnet last week and it was great" are working, because not everyone uses it to write code - I personally don't because I care a lot about code style.
[1]: https://github.com/cloudflare/workers-oauth-provider/
[2]: https://tools.simonwillison.net/
[3]: https://steveklabnik.com/writing/a-tale-of-two-claudes/
[4]: https://antirez.com/news/153
1 reply →
> The discussions need to include things like exact model version, inference parameters, what system prompt you used, what user prompt, what code you gave it, what exactly it replied and so much more details, as currently almost every comment is "Well, I used Sonnet last week and it worked great" without any details...Not to mention discussions around local models missing basic stuff like what quantization (if any) and what hardware you're running it on.
While I agree with "more details", the amount of details you're asking for is ... ridiculous. This is a HN comment, not a detailed study.
1 reply →
I feel like that would get tiresome to write, read, and sort through. I don't like everyone's workflow, but if I notice someone making a claim that indicates they might be doing something better than me, then I'm interested.
Maybe keeping your HN profile/gist/repo/webpage up to date would be better.
I don’t know about fixing python types, but fixing typescript types can be very time consuming. A LOT of programming work is like this —- not solving anything interesting or difficult, but just time-consuming drudgery.
These tools have turned out to be great at this stuff. I don’t think I’ve turned over any interesting problems to an LLM and had it go well, but by using them to take care of drudgery, I have a lot more time to think about the interesting problems.
I would suggest that instead of asking people to post their work, try it out on whatever bullshit tasks you’ve been avoiding. And I specifically mean “tasks”. Stuff where the problem has already been solved a thousand times before.
For me comments are for discussions, not essays - from my perspective you went straight into snark about the parent's coding abilities, which kinda kills any hope of a conversation.
I trust it more with Rust than Python tbh, because with Python you need to make sure it runs every code path as the static analysis isn't as good as clippy + rust-analyzer.
I agree, had more luck with various models writing Rust than Python, but only in the case where they have tools available so one way or another it can run `cargo check` and see the nice errors, otherwise it's pretty equal between the two.
I think the excellent error messages in Rust also help as much humans as it does LLMs, but some of the weaker models get misdirected by some of the "helpful" tips, like some error message suggest "Why don't you try .clone here?" when the actual way to address the issue was something else.
That's true typed languages seem to handle the slop better. One thing I've noticed specifically with rust is that agents tend to overcomplicate things though. They tend to start digging into the gnarlier bits of the language much quicker than they probably need to.
Whats your workflow? Ive been playing with Claude Code for personal use. Usually new projects for experimentation. We have Copilot licenses through work so I've been playing around with VS Code agent mode for the last week. Usually using 3.5, 3.7 Sonnet or 04-mini. This is in a large Go project. Its been abysmal at everything other than tests. I've been trying to figure out if I'm just using the tooling wrong but I feel like I've tried all the "best practices" currently. Contexts, switching models for planning and coding, rules, better prompting. Nothings worked so far.
Switch to using Sonnet 4 (it's available in VS Code Insiders for me at least). I'm not 100% sure but a Github org admin and/or you might need to enable this model in the Github web interface.
Write good base instructions for your agent[0][1] and keep them up to date. Have your agent help you write and critique it.
Start tasks by planning with your agent (e.g. "do not write any code."), and have your agent propose 2-3 ways to implement what you want. Jumping straight into something with a big prompt is hit or miss, especially with increased task complexity. Planning also gives your agent a chance to read and understand the context/files/code involved.
Apologies if I'm giving you info you're already aware of.
[0] https://code.visualstudio.com/docs/copilot/copilot-customiza...
[1] Claude Code `/init`
This is exactly what I was looking for. Thanks! Im trying to give these tools a fair shot before I judge them. Ive had success with detailed prompts and letting the agent jump straight in when working on small/new projects. Ill give more planning prompts a shot.
Do you change models between planning and implementation? I've seen that recommended but it's been hard to judge if that's made a difference.
1 reply →
I really don't get it. I've tested some agents and they can generate boilerplate. It looks quite impressive if you look at the logs, actually seems like an autonomous intelligent agent.
But I can run commands on my local linux box that generate boilerplate in seconds. Why do I need to subscribe to access gpu farms for that? Then the agent gets stuck at some simple bug and goes back and forth saying "yes, I figured out and solved it now" and it keeps changing between two broken states.
The rabid prose, the Fly.io post deriding detractors... To me it seems same hype as usual. Lots of words about it, the first few steps look super impressive, then it gets stuck banging against a wall. If almost all that is said is prognostication and preaching, and we haven't seen teams and organizations racing ahead on top of this new engine of growth... maybe it can't actually carry loads outside of the demo track?
It can be useful. Does it merit 100 billion dollar outlays and datacenter-cum-nuclear-powerplant projects? I hardly think so.
What commands/progs on your local Linux box? Would love to be able to quantify how inaccurate the LLMs are compared to what people already use for their boilerplate stuff.
I've found the agents incredibly hit and miss. Mostly miss. The likes of Claude Code occasionally does something surprising and it actually works (usually there's a public example it's copied wholly when you research the code it gave you, especially for niche stuff), but then the rest of the time you spend hours wrestling it into submission over something you could do in minutes, all whilst it haemorrhages context sporadically. Even tried adding an additional vector database to the likes of Claude Code to try and get around this, but it's honestly a waste of time in my experiences.
Is it "useless"? For me, yes, probably. I can't find any valid use for an LLM so far in terms of creating new things. What's already been done before? Sure. But why an LLM in that case?
The strangest thing I've seen so far is Claude Code wanting a plugin to copy values from a metadata column in WordPress to then read, which is triggered by a watcher every five minutes—instead of just reading the value when relevant. It could not be wrangled into behaving over this and I gave up.
Took me 2 minutes to do the whole thing by hand, and it worked first try (of course—it's PHP—not complicated compared to Verilog and DSP, at which it is spectacularly bad in its output).
It does very odd things in terms of secrets and Cloudflare Workers too.
The solutions it gives are frequently nonsensical, incomplete, mixes syntax from various languages (which sometimes it catches itself on before giving you the artifact), and almost always wholly in how inefficient the pointless steps to a simple task are.
Giving Claude Code tutorials, docs, and repos of code is usually a shitshow too. I asked their customer support for a refund weeks ago and have heard nothing. All hype and no substance.
I can see how someone without much dev experience might be impressed by its output, especially if they're only asking it to do incredibly simplistic stuff, for which there's plenty of examples and public discourse on troubleshooting bad code, but once you get into wanting to do new things, I just don't see how anyone could think this is ever going to be viable.
I mucked around with autonomous infrastructure via Claude Code too, and just found that it did absolutely bizarre things that made no sense in terms of managing containers relative to logs, suggesting configurations et al. Better off with dumb scripts with your env vars, secrets et al.
make sure it writes a requirements and design doc for the change its gonna make, and review those. and, ask it to ask you questions about where there's ambiguity, and to record those responses.
when it has a work plan, track the workplan as a checklist that it fills out as it works.
you can also atart your conversations by asking it to summarize the code base
My experiments with copilot and Claude desktop via mcp on the same codebase suggest that copilot is trimming the context much more than desktop. Using the same model the outputs are just less informed.
> you wouldn't be able to tell from my git history.
I can easily tell from git history which commits were heavily AI generated
> Even when half my day is meetings, you wouldn't be able to tell from my git history.
Your employer, if it is not you, will now expect this level of output.
> Is there more we can add to the AI conventions markdown in the repo to guide the Agent to make fewer mistaken assumptions?
Forgive my ignorance, but is this just a file you're adding to the context of every agent turn or this a formal convention in the VS code copilot agent? And I'm curious if there's any resources you used to determine the structure of that document or if it was just a refinement over time based on mistakes the AI was repeating?
I just finished writing one. It is essentially the onboarding doc for your project.
It is the same stuff you'd tell a new developer on your team: here are the design docs, here are the tools, the code, and this is how you build and test, and here are the parts you might get hung up on.
In hindsight, it is the doc I should have already written.
> "Today is the worst day you will have with this technology for the rest of your life."
Why do we trust corporations to keep making things better all of a sudden?
The most jarring effect of this hype cycle is that all appear to refers to some imaginary set of corporate entities.