Comment by ricardobeat
10 hours ago
But how is that testable? If your test is validating the rigidity, water resistance, etc, they will all pass even if the underlying material is a bad choice. Or the glue will degrade in six months.
You can't test if a codebase will be extensible or maintainable as requirements change in the future, if the abstraction level or architecture is sound - that's down to code quality measures like the ones used here. LLMs are very good at slightly cheating to pass tests even when the implementation is wrong. Introducing subjectivity - the kind of input a human will provide - leads to improved output.
https://senior-swe-bench.snorkel.ai/blog/2026-06-16-how-it-w...
That's why we should simulate changing requirements, for example with an LLM roleplaying as a human who's co-developing with an agent. Simply asking the LLM to add one big feature is not enough. I don't see why we shouldn't be able to build a more advanced benchmark. Attempting to benchmark "taste" is not the way.