Comment by KronisLV

19 days ago

> Code must not be written by humans

This might be okay short term, e.g. when you just want to get something done.

Maybe not on the scale of decades, otherwise we will end up completely unable to code.

> Code must not be reviewed by humans

This is where it all goes to crap. Until the day when the AI agents can look at some output and be like "no, this is overengineering, this can be done way more simply, let's stick to the established patterns within the codebase" and do so consistently, not having oversight will compound failure. I don't mean between every single small change, but rather at least to catch failures before merging anything.

This could only be avoided if you could define a harness with thousands or tens of thousands of tests per codebase, encapsulating EVERYTHING that must and must not be done within the codebase, down to the way how gaps and colors and utility classes are to be used, which I don't see most people doing.

This will take both better models and 3-10 years of work to actually make using them more foolproof and consistent. Even so, context sizes might need to be in the hundreds of thousands of tokens for the average task, all of those human "hunches" and styles and approaches to a given codebase spelled out.

> evaluating success often required LLM-as-judge

This is just advocating for doing what's easy, not proper - this will lead to slop long term. Though maybe they say that's okay, as long as it works.

Most of the time you would want those checks to be more dependable than that, the same way how you wouldn't want your linter to have randomness.

> Tests can be reward hacked - we needed validation that was less vulnerable to the model cheating

Just have adversarial agents, the one that writes the code doesn't touch the tests and vice versa, even though each has all of the context, each is told to care about different things.

Seems like these people are trying to push an envelope, and it might look like it's going to work in some respects, but they're very much taking big risks on what's currently feasible vs not.

> If you haven't spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement