← Back to context

Comment by porise

5 hours ago

I wish the people who wrote this let us know what king of codebases they are working on. They seem mostly useless in a sufficiently large codebase especially when they are messy and interactions aren't always obvious. I don't know how much better Claude is than ChatGPT, but I can't get ChatGPT to do much useful with an existing large codebase.

This is an antidotal example, but I released this last week after 3 months of work on it as a "nights and weekdends" project: https://apps.apple.com/us/app/skyscraper-for-bluesky/id67541...

I've been working in the mobile space since 2009, though primarily as a designer and then product manager. I work in kinda a hybrid engineering/PM job now, and have never been a particularly strong programmer. I definitely wouldn't have thought I could make something with that polish, let alone in 3 months.

That code base is ~98% Claude code.

  • I don’t know if “antidotal example” is a pun or a typo but I quite like it.

    • That is fun.

      Not sure if it's an American pronunciation thing, but I had to stare at that long and hard to see the problem and even after seeing it couldn't think of how you could possibly spell the correct word otherwise.

      1 reply →

> They seem mostly useless in a sufficiently large codebase especially when they are messy and interactions aren't always obvious.

What type of documents do you have explaining the codebase and its messy interactions, and have you provided that to the LLM?

Also, have you tried giving someone brand new to the team the exact same task and information you gave to the LLM, and how effective were they compared to the LLM?

> I don't know how much better Claude is than ChatGPT, but I can't get ChatGPT to do much useful with an existing large codebase.

As others have pointed out, from your comment, it doesn't sound like you've used a tool dedicated for AI coding.

(But even if you had, it would still fail if you expect LLMs to do stuff without sufficient context).

1. Write good documentation, architecture, how things work, how to write resources, code styling, etc.

2. Put your more important dependencies source code in the same directory. E.g. put a `_vendor` directory in the project, in it put the codebase at the same tag you're using of whatever: postgres, redis, vue, whatever.

3. Write good plans and requirements. Acceptance criteria, context, user stories, etc. Save them in markdown files. Review those multiple times with LLMs trying to find weaknesses. Then move to implementation files: make it write a detailed plan of what it's gonna change and why, and what it will produce.

4. Write very good prompts. LLMs follow instructions well if they are clear "you should proactively do X", is a weak instruction if you mean "you must do X".

5. LLMs are far from perfect, and full of limits. Karpathy sums their cons very well in his long list. If you don't know their limits you'll mismanage the expectations and not use them when they are a huge boost and waste time on things they don't cope well with. On top of that: all LLMs are different in their "personality", how they adhere to instruction, how creative they are, etc.

For me, in just the golang server instance and the core functional package, `cloc` reports over 40k lines of code, not counting other supporting packages. I spent the last week having Claude rip out the external auth system and replace it with a home-grown one (and having GPT-codex review its changes). If anything, Claude makes it easier on me as a solo founder with a large codebase. Rather than having to re-familiarize myself with code I wrote a year ago, I describe it at a high level, point Claude to a couple of key files, and then tell it to figure out what it needs to do. It can use grep, language server, and other tools to poke around and see what's going on. I then have it write an "epic" in markdown containing all the key files, so that future sessions already know the key files to read.

I really enjoyed the process. As TFA says, you have to keep a close eye on it. But the whole process was a lot less effort, and I ended up doing mor than I would otherwise have done.

Almost always, notes like these are going to be about greenfield projects.

Trying to incorporate it in existing codebases (esp when the end user is a support interaction or more away) is still folly, except for closely reviewed and/or non-business-logic modifications.

That said, it is quite impressive to set up a simple architecture, or just list the filenames, and tell some agents to go crazy to implement what you want the application to do. But once it crosses a certain complexity, I find you need to prompt closer and closer to the weeds to see real results. I imagine a non-technical prompter cannot proceed past a certain prototype fidelity threshold, let alone make meaningful contributions to a mature codebase via LLM without a human engineer to guide and review.

  • I'm using it on a large set of existing codebases full of extremely ugly legacy code, weird build systems, tons of business logic and shipping directly to prod at neckbreaking growth over the last two years, and it's delivering the same type of value that Karpathy writes about.

  • That was true for me, but is no longer.

    It's been especially helpful in explaining and understanding arcane bits of legacy code behavior my users ask about. I trigger Claude to examine the code and figure out how the feature works, then tell it to update the documentation accordingly.

  • These models do well changing brownfield applications that have tests because the constraints on a successful implementation are tight. Their solutions can be automatically augmented by research and documentation.

It's important to understand that he's talking about a specific set of models that were release around november/december, and that we've hit a kind of inflection point in model capabilities. Specifically Anthropic's Opus 4.5 model.

I never paid any attention to different models, because they all felt roughly equal to me. But Opus 4.5 is really and truly different. It's not a qualitative difference, it's more like it just finally hit that quantitative edge that allows me to lean much more heavily on it for routine work.

I highly suggest trying it out, alongside a well-built coding agent like the one offered by Claude Code, Cursor, or OpenCode. I'm using it on a fairly complex monorepo and my impressions are much the same as Karpathy's.

Claude and Codex are CLI tools you use to give the LLM context about the project on your local machine or dev environment. The fact that you're using the name "ChatGPT" instead of Codex leads me to believe you're talking about using the web-based ChatGPT interface to work on a large codebase, which is completely beside the point of the entire discussion. That's not the tool anyone is talking about here.

The code base I work on at $dayjob$ is legacy, has few files with 20k lines each and a few more with around 10k lines each. It's hard to find things and connect dots in the code base. Dont think LLMs able to navigate and understand code bases of that size yet. But have seen lots of seemingly large projects shown here lately that involve thousands of files and millions of lines of code.

  • I’ve found that LLMs seem to work better on LLM-generated codebases.

    Commercial codebases, especially private internal ones, are often messy. It seems this is mostly due to the iterative nature of development in response to customer demands.

    As a product gets larger, and addresses a wider audience, there’s an ever increasing chance of divergence from the initial assumptions and the new requirements.

    We call this tech debt.

    Combine this with a revolving door of developers, and you start to see Conway’s law in action, where the system resembles the organization of the developers rather than the “pure” product spec.

    With this in mind, I’ve found success in using LLMs to refactor existing codebases to better match the current requirements (i.e. splitting out helpers, modularizing, renaming, etc.).

    Once the legacy codebase is “LLMified”, the coding agents seem to perform more predictably.

    YMMV here, as it’s hard to do large refactors without tests for correctness.

    (Note: I’ve dabbled with a test first refactor approach, but haven’t gone to the lengths to suggest it works, but I believe it could)

I don't know how big sufficiently large codebase is, but we have a 1mil loc Java application, that is ~10years old, and runs POS systems, and Claude Code has no issues with it. We have done full analyses with output details each module, and also used it to pinpoint specific issues when described. Vibe coding is not used here, just analysis.

What do you even mean by "ChatGPT"? Copy pasting code into chatgpt.com?

AI assisted coding has never been like that, which would be atrocious. The typical workflow was using Cursor with some model of your choice (almost always an Anthropic model like sonnet before opus 4.5 released). Nowadays (in addition to IDEs) it's often a CLI tool like Claude Code with Opus or Codex CLI with GPT Codex 5.2 high/xhigh.

If you have a ChatGPT account, there's nothing stopping you from installing codex cli and using your chatgpt account with it. I haven't coded with ChatGPT for weeks. Maybe a month ago I got utility out of coding with codex and then having ChatGPT look at my open IDE page to give comments, but since 5.2 came out, it's been 100% codex.

I've been trying Claude on my large code base today. When I give it the requirements I'd give an engineer and so "do it" it just writes garbage that doesn't make sense and doesn't seem to even meet the requirements (if it does I can't follow how - though I'll admit to giving up before I understood what it did, and I didn't try it on a real system). When I forced it to step back and do tiny steps - in TDD write one test of the full feature - it did much better - but then I spent the next 5 hours adjusting the code it wrote to meet our coding standards. At least I understand the code, but I'm not sure it is any faster (but it is a lot easier to see things wrong than come up with green field code).

Which is to say you have to learn to use the tools. I've only just started, and cannot claim to be an expert. I'll keep using them - in part because everyone is demanding I do - but to use them you clearly need to know how to do it yourself.

  • Have you tried showing it a copy of your coding standards?

    I also find pointing it to an existing folder full of code that conforms to certain standards can work really well.

  • I've been playing around with the "Superpowers" [0] plugin in Claude Code on a new small project and really like it. Simple enough to understand quickly by reading the GitHub repo and seems to improve the output quality of my projects.

    There's basically a "brainstorm" /slash command that you go back and forth with, and it places what you came up with in docs/plans/YYYY-MM-DD-<topic>-design.md.

    Then you can run a "write-plan" /slash command on the docs/plans/YYYY-MM-DD-<topic>-design.md file, and it'll give you a docs/plans/YYYY-MM-DD-<topic>-implementation.md file that you can then feed to the "execute-plan" /slash command, where it breaks everything down into batches, tasks, etc, and actually implements everything (so three /slash commands total.)

    There's also "GET SHIT DONE" (GSD) [1] that I want to look at, but at first glance it seems to be a bit more involved than Superpowers with more commands. Maybe it'd be better for larger projects.

    [0] https://github.com/obra/superpowers

    [1] https://github.com/glittercowboy/get-shit-done

Try Claude code. It’s different.

After you tried it, come back.

  • I think its not Claude code per se itself but rather the (Opus 4.5 model?) or something in an agentic workflow.

    I tried a website which offered the Opus model in their agentic workflow & I felt something different too I guess.

    Currently trying out Kimi code (using their recent kimi 2.5) for the first time buying any AI product because got it for like 1.49$ per month. It does feel a bit less powerful than claude code but I feel like monetarily its worth it.

    Y'know you have to like bargain with an AI model to reduce its pricing which I just felt really curious about. The psychology behind it feels fascinating because I think even as a frugal person, I already felt invested enough in the model and that became my sunk cost fallacy

    Shame for me personally because they use it as a hook to get people using their tool and then charge next month 19$ (I mean really Cheaper than claude code for the most part but still comparative to 1.49$)

They build Claude Code fully with Claude Code.

  • Which is equal parts praise and damnation. Claude Code does do a lot of nice things that people just kind of don't bother for time cost / reward when writing TUIs that they've probably only done because they're using AI heavily, but equally it has a lot of underbaked edges (like accidentally shadowing the user's shell configuration when it tries to install terminal bindings for shift-enter even though the terminal it's configuring already sends a distinct shift-enter result), and bugs (have you ever noticed it just stop, unfinished?).

  • Ah, now I understand why @autocomplete suddenly got broken between versions and still not fixed )

I'm afraid that we're entering a time when the performance difference between the really cutting edge and even the three-month-old tools is vast

If you're using plain vanilla chatgpt, you're woefully, woefully out of touch. Heck, even plain claude code is now outdated

  • Why is plain Claude code outdated? I thought that’s what most people are using right now that are AI forward. Is it Ralph loops now that’s the new thing?

    • Plain Claude Code doesn’t have enough scaffolding to handle large projects

      At a base level, people are “upgrading” their Claude Code with custom skills and subagents - all text files saved in .claude/agents|skills.

      You can also use their new tasks primitive to basically run a Ralph-like loop

      But at the edges, people are using multiple instances, each handling different aspects in parallel - stuff like Gas Town

      Tbf you can still get a lot of mileage out of vanilla Claude Code. But I’ve found that even adding a simple frontend design skill improves the output substantially