← Back to context

Comment by iepathos

1 month ago

The "code witness" concept falls apart under scrutiny. In practice, the agent isn't replacing ripgrep with pure Python, it's generating a Python wrapper that calls ripgrep via subprocess. So you get:

- Extra tokens to generate the wrapper

- New failure modes (encoding issues, exit code handling, stderr bugs)

- The same underlying tool call anyway

- No stronger guarantees - actually weaker ones, since you're now trusting both the tool AND the generated wrapper

The theoretical framing about "proofs as programs" and "semantic guarantees" sounds impressive, but the generated wrapper doesn't provide stronger semantics than rg alone, it actually provides strictly weaker ones. This is true for pretty much any CLI tool you're having the AI wrap python code around to do instead of calling battle tested tools directly.

For actual development work, the artifact that matters is the code you're building, which we're already tracking in source control. Nobody needs a "witness" of how the agent found the right file to edit and if they do agents have parseable logs. Direct tool calls are faster, more reliable, and the intermediate exploration steps are ephemeral scaffolding anyway.

> In practice, the agent isn't replacing ripgrep with pure Python, it's generating a Python wrapper that calls ripgrep via subprocess.

Yep. I have very strong guardrails on what commands agents can execute, but I also have a "vterm" MCP server that the agent uses to test the TUI I'm developing in a real terminal emulator; it can send events, take screenshots, etc.

More than once it's worked around bash tool limitations by using the vterm MCP server to exit the TUI app under development and start issuing unrestricted bash commands. I'm probably going to add command filtering on what can be run under vterm (so it can't exit back to an initial shell), which will help unless/until I add a "!<script>" style command to my TUI, in which case I'm sure it'll find and exploit that instead.

> but the generated wrapper doesn't provide stronger semantics than rg alone, it actually provides strictly weaker ones

I don't know if I agree with this.

I had been doing some experiments using Powershell as the only available tool, and I found that switching to an ExecuteFunction (C#) tool provided a much less buggy experience, even when Process.Start is involved.

Which one is functionally a superset of the other is actually kind of a chicken-egg problem because they can both bootstrap into the other. However, in practice the code tool seems to provide far more "paths" and intermediate tokens to absorb the complexity of the original ask. Powershell seemed much more constraining at the edges. I had a lot of trouble getting the shell to accept verbatim strings as file contents. csc.exe has zero issues with this by comparison.

The trick here is to make the wrappers permanent. Give the agent an environment (VM, whatever) where all of these utilities are stored after being generated.

Basically you let the agent create its own tools and reuse them instead of rewriting them every time from scratch.