Comment by dakshgupta
11 hours ago
> Independence
It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways. Question: Would you trust a Cursor review of Claude-written code more, less, or the same as a Cursor review of Cursor-written code?
> Autonomy
Plenty of tools have invested heavily in AI-assisted review - creating great UIs to help human reviewers understand and check diffs. Our view is that code validation will be completely autonomous in the medium term, and so our system is designed to make all human intervention optional. This is possibly a unpopular opinion, and we respect the camp that might say people will always review AI-generated code. It's just not the future we want for this profession, nor the one we predict.
> Loops
You can invest in UX and tooling that makes this easier or harder. Our first step towards making this easier is a native Claude Code plugin in the `/plugins` command that let's Claude code do a plan, write, commit, get review comments, plan, write loop.
Independence is ridiculous - the underlying llm models are too similar on their training days and methodologies to be anything like independent. Trying different models may somewhat reduce the dependency, but all have read stack overflow, Reddit, and GitHub in their training.
It might be an interesting time to double down on automatically building and checking deterministic models of code which were previously too much of a pain to bother with. Eg, adding type checking to lazy python code. These types of checks really are model independent, and using agents to build and manage them might bring a lot of value.
> Would you trust a Cursor review of Claude-written code more, less, or the same as a Cursor review of Cursor-written code?
You're assuming models/prompts insist on a previous iteration of their work being right. They don't. Models try to follow instructions, so if you ask them to find issues, they will. 'Trust' is a human problem, not a model/harness problem.
> Our view is that code validation will be completely autonomous in the medium term.
If reviews are going to be autonomous, they'd be part of the coding agent. Nobody would see it as an independent activity, you mentioned above.
> Our first step towards making this easier is a native Claude Code plugin.
Claude can review code based on a specific set of instructions/context in an MD file. An additional plugin is unnecessary.
My view is that to operate in this space, you gotta build a coding agent or get acquired by one. The writing was on the wall a year ago.
> It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways.
Is there empirical evidence for that? Where is it on an epistemic meter between (1) “it sounds good when I say it”, and (10) “someone ran evaluation and got significant support.”
“Vibes” (2/3 on scale) are ok, just honestly curious.