Comment by siliconc0w

3 hours ago

I agree that every so often you have to clean up a mess and the illusion breaks. Even with a super detailed spec, even with AGENTS and SKILLs specifying certain patterns or practices, even with 'fresh eyes' reviews from other agents, etc there are still these long tail of issues where I have to either hand hold the agent or just manually rework the code. Some examples:

* it cheats at verification. Even with specific instructions how to verify, it still cheats.

* generating UX(CLI tool) that is absolute garbage and inconsistent, even with specific instructions to minimize unnecessary flags, use convention over configuration ,etc.

* it absolutely will not go 'above and beyond' to solve problems - if task is hitting a permission or dependency barrier, it'll likely cheat or handwave the problem away. (gpt 5.5 xhigh)

There is maybe this hope/hubris that we can figure out just the right incantations or agent workflows to eliminate these issues - I was optimistic about this too but after trying for awhile and seeing them not only not go away but in some cases regress with newer models, I am less sure.

> it cheats at verification. Even with specific instructions how to verify, it still cheats.

As I responded to another commenter, as a prediction engine, the LLM is trying to predict what you want. It, at one level, correctly predicts that you want tests to pass.

Maybe try telling the LLM that you're a verification engineer, and you get bonuses for finding bugs?

Think about it. All those security researchers wouldn't be finding real bugs in real programs using LLMs if this were an insurmountable problem.