← Back to context

Comment by tharkun__

1 day ago

No sawdust is bad. But it's also bad if you cut all your boards into sawdust. Completely. Obliterated. No useful output, only sawdust.

% of AI suggestions accepted vs. edited is also a BS metric that Anthropic et. al. like to push, similar to LoC, because they're large numbers and large numbers must be good, right?

Well guess what, I have auto-accept on and then adjust after it's "done". And I do it by telling it what changes to make and those have auto-accept on as well. That's quite a high "accept" rate, by definition. But in reality it may have churned on 50% of the lines it generated and auto-accepted first.

> % of AI suggestions accepted vs. edited is also a BS metric

I disagree. It’s a valuable metric if you are building an agent / skill infra layer.

Think of it like error rate on your API. Green metric does not mean your system is healthy, but if it’s red you have an issue you definitely need to fix.

Your example scenario is detectable in the non-naive implementation anyway; the o11y layer (usually OTel these days) tracks the trajectories, links them to the diff, and attributes each hunk as coming from the session or not.

  • Not the one down-voting you btw. Disagreeing is fine by me.

    I would ask you tho: What incentive do AI vendors have to even try and detect this? It's in their interest to use the most naive interpretation, i.e. what my original comment mentioned, as it shows how "good" their models are, coz nobody ever changes much if anything ;)

    Never mind that they really can't unless they're going "creepy mode". If I use Claude/Codex et. al. to agentically write something, then let the session just sit while I go about in my IDE changing things and then I commit and push, are you telling me that the vendors do or should track all of the changes made to the files they touched and report back to base what got overridden by me, the human?

    • I agree that providers are in some sense incentivized to juice the numbers here. But, they are in an incredibly competitive 3-way knife fight, and so they are also heavily incentivized to be honest with themselves about quality gaps.

      I think I better understand your point now. I was mostly arguing for this as an internal metric inside the model user’s company, I agree it’s less useful coming directly from Anthropic’s measurements.

      What I meant by “agent / skill infra layer” is if you’re a big company and trying to write skills that are widely shared, build common tooling for thousands of engineers to use agents within a big repo, etc.

      RE “creepy”, I dunno, this case doesn’t bother me, but I can see why it might. It’s definitely being done though.