Comment by Leave_OAI_Alone
5 days ago
You have compiled an interesting list of benchmarks and adjacent research. The implicit question is whether an established benchmark for building a full product exists.
After reviewing all this, what is your actual conclusion, or are you asking? Is the takeaway that a comprehensive benchmark exists and we should be using it, or is the takeaway that the problem space is too multifaceted for any single benchmark to be meaningful?
The market - actual customers - is probably the best benchmark for a product.
But then outstanding liabilities due to code quality and technical debt aren't costed in by the market.
There are already code quality metrics.
SAST and DAST tools can score or fix code, as part of a LLM-driven development loop.
Formal verification is maybe the best code quality metric.
Is there more than Product-Market fit and infosec liabilities?