Comment by jasonthorsness
2 months ago
“GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot.”
Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handled automatically. This tech could lead to a huge revival of older projects as the maintenance burden falls.
It could be! But that's also what people said about all the models before it!
And they might all be right!
> This tech could lead to...
I don't think he's saying this is the version that will suddenly trigger a Renaissance. Rather, it's one solid step that makes the path ever more promising.
Sure, everyone gets a bit overexcited each release until they find the bounds. But the bounds are expanding, and the need for careful prompt engineering is diminishing. Ever since 3.7, Claude has been a regular part of my process for the mundane. And so far 4.0 seems to take less fighting for me.
A good question would be when can AI take a basic prompt, gather its own requirements and build meaningful PRs off basic prompt. I suspect it's still at least a couple of paradigm shifts away. But those seem to be coming every year or faster.
Did you not see the live stream? They took a feature request for excalidraw (table support) and Claude 4 worked on it for 90 minutes and the PR was working as expected. I’m not sure if they were using sonnet or opus.
4 replies →
I am incredibly eager to see what affordable coding agents can do for open source :) in fact, I should really be giving away CheepCode[0] credits to open source projects. Pending any sort of formal structure, if you see this comment and want free coding agent runs, email me and I’ll set you up!
[0] My headless coding agents product, similar to “assign to copilot” but works from your task board (Linear, Jira, etc) on multiple tasks in parallel. So far simple/routine features are already quite successful. In general the better the tests, the better the resulting code (and yes, it can and does write its own tests).
> I am incredibly eager to see what affordable coding agents can do for open source :)
Oh, we know exactly what they will do: they will drive devs insane: https://www.reddit.com/r/ExperiencedDevs/comments/1krttqo/my...
I dunno, looking through those issues I'd be more annoyed by all the randos grandstanding in my PRs.
1 reply →
Especially since the EU just made open source contributors liable for cybersecurity (Cyber Resilience Act). Just let AI contribute and ur good
Didn’t they make an exception for open-source projects? https://opensource.org/blog/the-european-regulators-listened...
8 replies →
That's kind of my benchmark for whether or not these models are useful. I've got a project that needs some extensive refactoring to get working again. Mostly upgrading packages, but also it will require updating the code to some new language semantics that didn't exist when it was written. So far, current AI models can make essentially zero progress on this task. I'll keep trying until they can!
Personally, I don't believe AI is ever going to get to that level. I'd love to be proven wrong, but I really don't believe that an LLM is the right tool for a job that requires novel thinking about out of the ordinary problems like all the weird edge cases and poor documentation that comes up when trying to upgrade old software.
Actually, I think the opposite: Upgrading a project that needs dependency updates to new major versions—let’s say Zod 4, or Tailwind 3—requires reading the upgrade guides and documentation, and transferring that into the project. In other words, transforming text. It’s thankless, stupid toil. I’m very confident I will not be doing this much more often in my career.
10 replies →
That's the easiest task for an LLM to do. Upgrading from x.y to z.y is for the most part syntax changes. The issue is that most of the documentation sucks. The LLM issue is that it doesn't have access to that documentation in the first place. Coding LLMs should interact with LSPs like humans do. You ask the LSP for all possible functions, you read the function docs and then you type from the available list of options.
LLMs can in theory do that but everyone is busy burning GPUs.
Google demoed an automated version upgrade for Android libraries during I/O 2025. The agent does multiple rounds and checks error messages during each build until all dependencies work together.
Agentic Experiences: Version Upgrade Agent
https://youtu.be/ubyPjBesW-8?si=VX0MhDoQ19Sc3oe-
1 reply →
And IMO it has a long way to go. There is a lot of nuance when orchestrating dependencies that can cause subtle errors in an application that are not easily remedied.
For example a lot of llms (I've seen it in Gemini 2.5, and Claude 3.7) will code non-existent methods in dynamic languages. While these runtime errors are often auto-fixable, sometimes they aren't, and breaking out of an agentic workflow to deep dive the problem is quite frustrating - if mostly because agentic coding entices us into being so lazy.
"... and breaking out of an agentic workflow to deep dive the problem is quite frustrating"
Maybe that's the problem that needs solving then? The threshold doesn't have to be "bot capable of doing entire task end to end", like it could also be "bot does 90% of task, the worst and most boring part, human steps in at the end to help with the one bit that is more tricky".
Or better yet, the bot is able to recognize its own limitations and proactively surface these instances, be like hey human I'm not sure what to do in this case; based on the docs I think it should be A or B, but I also feel like C should be possible yet I can't get any of them to work, what do you think?
As humans, it's perfectly normal to put up a WIP PR and then solicit this type of feedback from our colleagues; why would a bot be any different?
2 replies →
The agents will definitely need a way to evaluate their work just as well as a human would - whether that's a full test suite, tests + directions on some manual verification as well, or whatever. If they can't use the same tools as a human would they'll never be able to improve things safely.
> if mostly because agentic coding entices us into being so lazy.
Any coding I've done with Claude has been to ask it to build specific methods, if you don't understand what's actually happening, then you're building something that's unmaintainable. I feel like it's reducing typing and syntax errors, sometime it leads me down a wrong path.
1 reply →
I think this type of thing needs agent which has access to the documentation to read about nuances of the language and package versions, definitely a way to investigate types, interfaces. Problem is that training data has so much mixed data it can easily confuse the AI to mix up versions, APIs etc.
> having package upgrades and other mostly-mechanical stuff handled automatically
Those are already non-issues mostly solved by bots.
In any case, where I think AI could help here would be by summarizing changes, conflicts, impact on codebase and possibly also conduct security scans.
Anyone see news of when it’s planned to go live in copilot?
The option just shown up in Copilot settings page for me
Turns out Opus 4 starts at their $40/mo ("Pro+") plan which is sad, and they serve o4-mini and Gemini as well so it's a bit less exclusive than this announcement implies. That said, I have a random question for any Anthropic-heads out there:
GitHub says "Claude Opus 4 is hosted by Anthropic PBC. Claude Sonnet 4 is hosted by Anthropic 1P."[1]. What's Anthropic 1P? Based on the only Kagi result being a deployment tutorial[2] and the fact that GitHub negotiated a "zero retention agreement" with the PBC but not whatever "1P" is, I'm assuming it's a spinoff cloud company that only serves Claude...? No mention on the Wikipedia or any business docs I could find, either.
Anyway, off to see if I can access it from inside SublimeText via LSP!
[1] https://docs.github.com/en/copilot/using-github-copilot/ai-m...
[2] https://github.com/anthropics/prompt-eng-interactive-tutoria...
3 replies →
Same! Rock and roll!
1 reply →
The keynote confirms it is available now.
Gotta love keynotes with concurrent immediate availability
3 replies →
I don't see how a LLM could do better than a bot, eg renovate
Until it pushes a severe vulnerability which takes a big service doen