Comment by iteria

3 days ago

You don't even need such fancy examples. There are plenty of codebases where people are working with code that is over a decade old and has several paradigms all intermixed with a lot of tribal knowledge that isn't documented in code or wiki. That is where AI sucks. It will not be able to make meaningfully change in that environment.

There is also the frontend and tnpse code bases don't need to be very old at all before AI falls down. NPM packages and clashing styles in a codebase and AI has been not very helpful to me at all.

Generally speaking, which AI is a fine enhancement to autocomplete, I haven't seen it be able to do anything more serious in a mature codebase. The moment business rules and tech debt sneak in in any capacity, AI becomes so unreliable that it's faster to just write it yourself. If I can't trust the AI to automatically generate a list of exports in an index.ts file. What can I trust it for?

When is the last time you tried using LLMs against a large, old, crufty undocumented codebase?

Things have changed a lot in the past six weeks.

Gemini 2.5 Pro accepts a million tokens and can "reason" with them, which means you can feed it hundreds of thousands of lines of code and it has a surprisingly good chance of figuring things out.

OpenAI released their first million token models with the GPT 4.1 series.

OpenAI o3 and o4-mini are both very strong reasoning code models with 200,000 token input limits.

These models are all new within the last six weeks. They're very, very good at working with large amounts of crufty undocumented code.

  • Ultimately LLMs don’t really understand what the code does at runtime. Sure, just parsing out the codebase can help make a good guess but in some cases it’s hard to trust LLMs with changes because the consequences are unknown in complex codebases that have weird warts nobody documented.

    Maybe in a generation or two codebases will become more uniform and predictible if fewer humans do it by hand. Same with self driving cars, if there were no human drivers out there the problem would become trivial to conquer.

    • That's a lot less true today than it was six weeks ago. The "reasoning" models are spookily good at answering questions about how code runs, and identifying the source of bugs.

      They still make mistakes, and yeah they're still (mostly) next token predicting machines under the hood, but if your mental model is "they can't actually predict through how some code will execute" you may need to update that.

  • Gemini 2.5 Pro crashes with a 50) status code every 5 requests. Not great for a model you're supposed to rely on.

    • Yeah, there's a reason it still has "preview" and "experimental" in the model names.