Comment by ekidd
15 days ago
Where this result is actually interesting and relevant is when a coding agent splits a large source file into multiple smaller files. Opus + Claude Code will try to recite long sections of source code from memory into each of the new files, instead of using some sort of copy/paste operation like a human would.
Moving a file is a bit easier. LLMs may sometimes try to recite the file from memory. But if you tell them to use "git mv" and fix the compiler errors, they mostly will.
Ordinary editing on the other hand, generally works fine with any reasonable model and tool setup. Even Qwen3.6 27B is fine at this. And for in-place edits, you can review "git diff" for surprises.
If you’re using LLMs for agentic work it is absolutely essential that you have a robust set of tools for them to use and the correct instructions to prompt their use.
The LLM will come up with stupid ways to do things, common sense doesn’t exist for AI.
Isn't this the whole reason they became viable in the last 6 months? The system prompt and harness is improving. It's less and less essential every day to roll your own.
I don't think there is a single reason. Models are improving, so are the harnesses, prompts and we who use them a lot also get more proficient and learn where they can be used effectively vs not, so lots of improvements all over the ecosystem, brought together.
Latest big change is probably how feasible local models are becoming, like Qwen 3.6 and Gemma 4, they're no longer easily getting stuck in loops and repetition, although on lower quantizations they still pretty much suck for agentic usage.
6 replies →
The models also have far more intelligence built in. For example, the pi.dev agent harness has a system prompt which fits on a single page, and includes only 4 or 5 tools. Running with a small coding model like Qwen3.6 27B, this setup is completely capable of agentic coding.
They still aren't viable. Nothing changed within the last 6 months.
My favorite is when Claude will build a completely new application to load and inspect a .dll file using reflection instead of just googling the library's interfaces.
It did this for during one of the recent outrage periods. It was unjarring deps left and right instead of googling for it. What an easy way for me to own the tokenmaxxing leaderboard I remember thinking
“Use all of the tools at your disposal, including searching the internet” is my claude-specific common instruction.
> And for in-place edits, you can review "git diff" for surprises.
I don't let AI touch git anyway, and I always review the diff after it generated stuff. If it modifies my documentation, I always want to check if it messed with the text instead of just added formatting.
This. I know the LLM agents often have their own little diff viewers and edit approval workflows, but for a high volume of code, I cannot imagine actually reviewing everything without leaning on much more capable Git tooling.
I use Magit, and up until I started using LLM agents it was mostly a nice-to-have that I relied on casually. (I was definitely under-utilizing its power.) But for reviewing, selectively staging, and selectively rejecting the changes of an LLM agent? I feel like I'd die without it. Idk how others manage.