← Back to context

Comment by ModernMech

2 months ago

That's kind of my benchmark for whether or not these models are useful. I've got a project that needs some extensive refactoring to get working again. Mostly upgrading packages, but also it will require updating the code to some new language semantics that didn't exist when it was written. So far, current AI models can make essentially zero progress on this task. I'll keep trying until they can!

Personally, I don't believe AI is ever going to get to that level. I'd love to be proven wrong, but I really don't believe that an LLM is the right tool for a job that requires novel thinking about out of the ordinary problems like all the weird edge cases and poor documentation that comes up when trying to upgrade old software.

  • Actually, I think the opposite: Upgrading a project that needs dependency updates to new major versions—let’s say Zod 4, or Tailwind 3—requires reading the upgrade guides and documentation, and transferring that into the project. In other words, transforming text. It’s thankless, stupid toil. I’m very confident I will not be doing this much more often in my career.

    • Absolutely, this should be exactly the kind of task a bot should be perfect for. There's no abstraction, no design work, no refactoring, no consideration of stakeholders, just finding instances of whatever is old and busted and changing it for the new hotness.

      3 replies →

    • Theoretically we don't even need AI. If semantics were defined well enough and maintainers actually concerned about and properly tracking breaking changes we could have tools that automatically upgrade our code. Just a bunch of simple scripts that perform text transformations.

      The problem is purely social. There are language ecosystems where great care is taken to not break stuff and where you can let your project rot for a decade or two and still come back to and it will perfectly compile with the newest release. And then there is the JS world where people introduce churn just for the sake of their ego.

      Maintaining a project is orders of magnitudes more complex than creating a new green field project. It takes a lot of discipline. There is just a lot, a lot of context to keep in mind that really challenges even the human brain. That is why we see so many useless rewrites of existing software. It is easier, more exciting and most importantly something to brag about on your CV.

      Ai will only cause more churn because it makes it easier to create more churn. Ultimately leaving humans with more maintenance work and less fun time.

      2 replies →

    • That assumes accurate documentation, upgrade guides that cover every edge case, and the miracle of package updates not causing a cascade of unforeseen compatibility issues.

      1 reply →

    • Except that for breaking changes you frequently need to know why it was done the old way in order to know what behavior it ago have after the update.

  • That's the easiest task for an LLM to do. Upgrading from x.y to z.y is for the most part syntax changes. The issue is that most of the documentation sucks. The LLM issue is that it doesn't have access to that documentation in the first place. Coding LLMs should interact with LSPs like humans do. You ask the LSP for all possible functions, you read the function docs and then you type from the available list of options.

    LLMs can in theory do that but everyone is busy burning GPUs.

And IMO it has a long way to go. There is a lot of nuance when orchestrating dependencies that can cause subtle errors in an application that are not easily remedied.

For example a lot of llms (I've seen it in Gemini 2.5, and Claude 3.7) will code non-existent methods in dynamic languages. While these runtime errors are often auto-fixable, sometimes they aren't, and breaking out of an agentic workflow to deep dive the problem is quite frustrating - if mostly because agentic coding entices us into being so lazy.

  • "... and breaking out of an agentic workflow to deep dive the problem is quite frustrating"

    Maybe that's the problem that needs solving then? The threshold doesn't have to be "bot capable of doing entire task end to end", like it could also be "bot does 90% of task, the worst and most boring part, human steps in at the end to help with the one bit that is more tricky".

    Or better yet, the bot is able to recognize its own limitations and proactively surface these instances, be like hey human I'm not sure what to do in this case; based on the docs I think it should be A or B, but I also feel like C should be possible yet I can't get any of them to work, what do you think?

    As humans, it's perfectly normal to put up a WIP PR and then solicit this type of feedback from our colleagues; why would a bot be any different?

    • > Maybe that's the problem that needs solving then? The threshold doesn't have to be "bot capable of doing entire task end to end", like it could also be "bot does 90% of task, the worst and most boring part, human steps in at the end to help with the one bit that is more tricky".

      Still, the big short-term danger being you're left with code that seems to work well but has subtle bugs in it, and the long-term danger is that you're left with a codebase you're not familiar with.

      1 reply →

  • The agents will definitely need a way to evaluate their work just as well as a human would - whether that's a full test suite, tests + directions on some manual verification as well, or whatever. If they can't use the same tools as a human would they'll never be able to improve things safely.

  • > if mostly because agentic coding entices us into being so lazy.

    Any coding I've done with Claude has been to ask it to build specific methods, if you don't understand what's actually happening, then you're building something that's unmaintainable. I feel like it's reducing typing and syntax errors, sometime it leads me down a wrong path.

    • I can just imagine it now, you launch your AI coded first product and get a bug in production, and the only way the AI can fix the bug is to re-write and deploy the app with a different library. Your then proceed to show the changelog to the CCB for approval including explaining the fix to the client trying to explain its risk profile for their signoff.

      "Yeh, we solved the duplicate name appearing the table issue by moving databases engines and UI frameworks to ones more suited to the task"

I think this type of thing needs agent which has access to the documentation to read about nuances of the language and package versions, definitely a way to investigate types, interfaces. Problem is that training data has so much mixed data it can easily confuse the AI to mix up versions, APIs etc.