Comment by tiny-automates

20 days ago

i've been building agent tooling for a while and this is the question i keep coming back to. the actual failure mode isn't messy code, agents produce reasonably clean, well-typed output these days. it's that the code confidently solves a different problem than what you intended. i've had an agent refactor an auth flow that passed every test but silently dropped a token refresh check because it "simplified" the logic. clean code, good types, tests green, security hole. so for me "quality" has shifted from cyclomatic complexity and readability scores to "does the output behaviour match the specification across edge cases, including the ones i didn't enumerate." that's fundamentally an evaluation problem, not a linting problem.

0 comments

tiny-automates

No comments yet

Contribute on Hacker News ↗