← Back to context

Comment by throwup238

12 hours ago

> Is anyone exploring the (imo more practically useful today) space of using agents to put together better changes vs "more commits"?

That’s what I’ve been focused on the last few weeks with my own agent orchestrator. The actual orchestration bit was the easy part but the key is to make it self improving via “workflow reviewer” agents that can create new reviewers specializing in catching a specific set of antipatterns, like swallowing errors. Unfortunately I've found that what decides acceptable code quality is very dependent on project, organization, and even module (tests vs internal utilities vs production services) so prompt instructions like "don't swallow errors or use unwrap" make one part of the code better while another gets worse, creating a conflict for the LLM.

The problem is that model eval was already the hardest part of using LLMs and evaluating agents is even harder if not practically impossible. The toy benchmarks the AI companies have been using are laughably inadequate.

So far the best I’ve got is “reimplement MINPACK from scratch using their test suite” which can take days and has to be manually evaluated.