Comment by saagarjha

3 days ago

I think the biggest difference here is that your code generator is probably deterministic and you likely are able to debug the results it produces rather than treating it like a black box.

Overloading of the term "generate" is probably creating some confused ideas here. An LLM/agent is a lot more similar to a human in terms of its transformation of input into output than it is to a compiler or code generator.

I've been working on a recent project with heavy use of AI (probably around 100 hours of long-running autonomous AI sprints over the last few weeks), and if you tried to re-run all of my prompts in order, even using the exact same models with the exact same tooling, it would almost certainly fall apart pretty quickly. After the first few, a huge portion of the remaining prompts would be referencing code that wouldn't exist and/or responding to things that wouldn't have been said in the AI's responses. Meta-prompting (prompting agents to prepare prompts for other agents) would be an interesting challenge to properly encode. And how would human code changes be represented, as patches against code that also wouldn't exist?

The whole idea also ignores that AI being fast and cheap compared to human developers doesn't make it infinitely fast or free, or put it in the same league of quickness and cheapness as a compiler. Even if this were conceptually feasible, all it would really accomplish is making it so that any new release of a major software project takes weeks (or more) of build time and thousands of dollars (or more) burned on compute.

It's an interesting thought experiment, but the way I would put it into practice would be to use tooling that includes all relevant prompts / chat logs in each commit message. Then maybe in the future an agent with a more advanced model could go through each commit in the history one by one, take notes on how each change could have been better implemented based on the associated commit message and any source prompts contained therein, use those notes to inform a consolidated set of recommended changes to the current code, and then actually apply the recommendations in a series of pull requests.

People keep saying this and it doesn't make sense. I review code. I don't construct a theory of mind of the author of the code. With AI-generated code, if it isn't eminently reviewable, I reflexively kill the PR and either try again or change the tasking.

There's always this vibe that, like, AI code is like an IOCCC puzzle. No. It's extremely boring mid-code. Any competent developer can review it.

  • I assumed they were describing AI itself as a black box (contrasting it with deterministic code generation), not the output of AI.

    • Right, I get that, and an LLM call by itself clearly is a black box. I just don't get why that's supposed to matter. It produces an artifact I can (and must) verify myself.

      5 replies →

  • You construct a theory of mind of the author of a work whether you recognize you are doing it or not. There are certain things everyone assumes about code based on the fact that we expect someone who writes code to have simple common sense. Which, of course, LLMs do not.

    When you are talking to a person and interpreting what they mean, you have an inherent theory of mind whether you are consciously thinking "how does this person think" or not. It's how we communicate with other people efficiently and it's one of the many things missing with LLM roulette. It's not that you generate a new "theory of mind" with every interaction. It's not something you have to consciously do (although you can, like breathing).