Comment by ricardobeat
9 hours ago
You buy a wooden dinner table, it is fully functional and looks perfect. It’s sturdy. You have dinner on it and it survives a few spills.
A few months later you find out it is made of PU foam and printed waxed paper. A misplaced knee could bring it down. It’s likely to completely fall apart in a year. Is that irrelevant?
Yes it is relevant and testable. It's exactly what I meant by "a measurable increase in quality of the final product". In fact a proper test harness would reveal that problem. You are forgetting that with LLMs, testing software does not have to end at the usual unit/integration/e2e level.
But how is that testable? If your test is validating the rigidity, water resistance, etc, they will all pass even if the underlying material is a bad choice. Or the glue will degrade in six months.
You can't test if a codebase will be extensible or maintainable as requirements change in the future, if the abstraction level or architecture is sound - that's down to code quality measures like the ones used here. LLMs are very good at slightly cheating to pass tests even when the implementation is wrong. Introducing subjectivity - the kind of input a human will provide - leads to improved output.
https://senior-swe-bench.snorkel.ai/blog/2026-06-16-how-it-w...
That's why we should simulate changing requirements, for example with an LLM roleplaying as a human who's co-developing with an agent. Simply asking the LLM to add one big feature is not enough. I don't see why we shouldn't be able to build a more advanced benchmark. Attempting to benchmark "taste" is not the way.