Comment by trjordan

15 hours ago

> Those are not code problems. They are evaluation problems.

> Code becomes precious when it is the only place knowledge lives.

Reading AI code all day is _agonizing_. Just, a horrible way to live, and it melts people's brains at the moment you need them to be the most capable.

Manual programming has this really productive and gratifying feedback loop, where you read the code, write the code, and fix it until it compiles/runs/does what you want. AI code not only does half that for you, but it makes the "click" at the end uninspiring because you're never sure if it's cheated a bit to get to that moment.

Trying to operate with AI-generated code as the only durable artifact of programming is a dead end for the industry. Charity points to (and correct discards) architecture diagrams/specs as an interesting space to work in. My suspicion is that it's closer to the thing that's hand-written: prompts, markdown plans, and other nudges. Focus on the thing that you, as a human, produce, and that's the basis for both the core loop of "did the AI follow my instructions" and it's higher-leverage when you go to code review.

By the time you get to the PR, you've probably typed enough to Claude that you can regenerate the code, but the current industry default is to just throw away all those sessions and ship the code. That's backwards!

If a coworker dumped a 5k-line code review on you, you'd tell them to come back when it's broken down into smaller, reviewable chunks. Large dumps of code are basically unreviewable by humans, but it seems like a lot of people have forgotten about that when it comes to LLMs.

  • I think it's worse than that. At least if I dumped 5k LoC on somebody in 2021, you knew I spent the time to write it, so it's "fair" to ask you to read it. But I didn't write it in 2026, so you shouldn't read it.

    I think it's less about "break it down" and more about "let's communicate at the same altitude."

    I wrote a (bait-titled) post about it: https://tern.sh/blog/stop-reading-prs/

    • 113 files +22913 −2423

      305 files +15075 −13110

      153 files +21934 −8698

      125 files +28120 −2398

      43 files +11188 −63

      118 files +21564 −647

      These are the largest (6 of 35) in the past 30 days. added: 190079 removed: 39696 in the last 6 months

      from one person.

      2 replies →

  • You aren't allowed to block PRs for being too large anymore. The objective that every engineer should be 2x/3x/5x more productive can only be achieved if you go totally lax on code reviews.

    Because if all your SWEs produce 5x more code, it also means they have to review 5x more code. But LLMs don't really help with code reviews. Then it becomes a Metcalfian paradox unless you just rubberstamp PRs, which is what is expected of you.

    • its pretty easy to point your terminal agent to your giant pr and ask it to break it up into small prs

      if youre being asked to rubberstamp prs thats a management skill issue

  • Breaking up a giant PR can be a tedious, time-consuming hassle, and in the past I could sympathize in practice if someone had a giant PR they didn't have time to decompose once they got it working.

    But it's also the exact sort of thing that LLMs are literally perfect for in my experience so there's really no excuse anymore. I've never seen Claude fail to turn a 5k PR into a well-decomposed Graphite stack.

    • Hell, I've hand-written a large PR as a single commit and then asked claude to break it down for me at least once. But I think the fact doing this task by hand is a tedious, time-consuming hassle is not because it inherently has to be but because the tooling for doing it has barely changed in the past 15 years.

  • Now you get not just the 5 LoC to review but a 5 page essay to read in the form an auto-generated review as well. Which makes the submitter even more indignant when you start nit picking things about how it's implemented.

  • It is not so much forgetting as much as it is acceptance that when welcoming AI into a codebase, the code can no longer matter; that all that matters is that the properties of the system are validated. That isn't a change that comes free, so nobody should be expecting magic, it is a different set of tradeoffs. There is no such thing as a panacea.

    • > all that matters is that the properties of the system are validated

      I don’t think this is possible in practice without leaning on the stability of the code base.

  • I think they expect you to also use an LLM to review, and I bet they are doing exactly that when asked to review someone else's code.

    • That gets you 90% the way there. So, it it only really works if you accept the cruft and the risks associated with that last 10%. Been doing this day in a day out for the last few months and no matter how much and how good we get the automated reviews, we still can't skip the manual ones.

  • > If a coworker dumped a 5k-line code review on you, you'd tell them to come back when it's broken down into smaller, reviewable chunks.

    I would, and all my training at Google told me to do that. But what I found after I left that comfortable box was that somehow this kind of practice is acceptable in the industry at large and you're expected to just Deal With It(tm). 5k lines isn't even high by what I've seen.

    Worse the "code review" tools that people have access to in GitHub make this absolutely and totally unworkable to incrementally improve review. Messy merge commits full of "responding to code review" comments. Threads impossible to follow. Just bad tooling.

    So a lot of shops, from what I've seen, are just yeeting it with very shallow reviews.

    This is my observation pre agentic AI. LLMs just threw kerosene on that dumpster fire.

"Reading AI code all day is _agonizing_. Just, a horrible way to live, and it melts people's brains at the moment you need them to be the most capable."

I think it's very similar to dealing with large offshore teams. Every day you get a huge pile of code to review. It's really exhausting.

I prefer dealing with AI because at least it tends to follow rules once I write them down. Not so much with a lot of offshore guys. Same mistakes every day.

I guess my company needs to hire better offshore devs....

I agree that reading AI code all day is agonizing. We're relying on code review to develop parts of our mental model of the system that were previously developed through coding. We're having more difficulty comprehending and recall details of the system. This is probably unsurprising; people recall information better that they "generated" than information they read. I am applying some lessons from pedagogy to extend code review. If this resonates with you, I would like to talk.

Are there any products out there that are capturing the prompts/sessions? I imagine you could do it in an adhoc way, asking Claude to write up a summary of the session as part of the commit message. But is there anything else that's more structured/higher level?

  • I am working on solving the AI Code Provenance problem and I believe my repos may be the first that provides AI code provenance. See the following example:

    https://github.com/gitsense/gsc-cli/blob/main/internal/cli/r...

    Notice how the code block header attributes the model. The UUID can be traced to the conversation so everybody can tell exactly how the code came about. For this to work though, you need to use my chat app as it ensures you can't tamper with things if you are truly serious about AI code provenance.

    I also have a lot more human-focused method which is part of my CLI tool.

    https://github.com/gitsense/gsc-cli

    I am currently looking at making pi (https://github.com/earendil-works/pi) support AI code provenance, but for now if you want a more structured way to capture what you have done in an agent session that can be used in code reviews and be carried forward as knowledge that lives inside your repository, I have

    gsc lessons

    The basic idea is, after you have finished chatting/working with the agent, you would work with it to identify lessons worth carrying forward. You can store your session if you want, but really, the lessons should be something that can help you review code better and to prevent future mistakes.

    I have a real working example at

    https://github.com/gitsense/smart-ripgrep

    This is a fork of the BurntSushi/ripgrep repository. It shows how you can use lessons to learn from past design decisions.

  • We just have hook that runs on git push that instructs Claude to ensure the PR description is up to date. Works well enough for us.

It almost seems like the juice might not be worth the squeeze. If you want verifiable code that conforms well to a well-designed plan, you have to basically write pseudocode and have the AI translate it for you. At that point why use the AI to write the code at all? And then, personally, I find that I just have more fun planning, writing, and debugging myself. I think its kinda the part of programming that I fell in love with in the first place.

  • This is the core of the insanity nobody who vibe codes seems to be able to grasp.

    It's not even that its more fun, AI can spew endless slop boilerplate but it simply can't handle boiling an application down to its component essence in a way that makes it straightforward to maintain and bug free.

Flintstone Engineering is applying Space Age synthetic intelligence (in a metaphorical sense) technology with code generation. Babysitting, version controlling, etc. generated code should be a thing of the past. But that is what GenAI is.

At the very least apply it at a higher level: specification, proofs, anything but generating Rust/Java/C and then letting yourself or an agent babysit it.

the act, eval, adjust loop is probably neurologically important.. reading about things you didn't dive into is really a dread

depending on your industry, you might be able to ship half-slop and then fix some bugs downstream though