Comment by westurner

6 days ago

Is there an established benchmark for building a full product?

- SWE-bench leaderboard: https://github.com/FoundationAgents/MetaGPT :

> Software Company as Multi-Agent System

> MetaGPT takes a one line requirement as input and outputs user stories / competitive analysis / requirements / data structures / APIs / documents, etc. Internally, MetaGPT includes product managers / architects / project managers / engineers. It provides the entire process of a software company along with carefully orchestrated SOPs.

- Mutation-Guided LLM-based Test Generation: https://github.com/codefuse-ai/Awesome-Code-LLM :

> 8.2 Benchmarks: Integrated Benchmarks, Evaluation Metrics, Program Synthesis, Visually Grounded Program, Synthesis, Code Reasoning and QA, Text-to-SQL, Code Translation, Program Repair, Code Summarization, Defect/Vulnerability Detection, Code Retrieval, Type Inference, Commit Message Generation, Repo-Level Coding

- underlines/awesome-ml/tools.md > Benchmarking: https://arxiv.org/abs/2402.00350

You have compiled an interesting list of benchmarks and adjacent research. The implicit question is whether an established benchmark for building a full product exists.

After reviewing all this, what is your actual conclusion, or are you asking? Is the takeaway that a comprehensive benchmark exists and we should be using it, or is the takeaway that the problem space is too multifaceted for any single benchmark to be meaningful?

  • The market - actual customers - is probably the best benchmark for a product.

    But then outstanding liabilities due to code quality and technical debt aren't costed in by the market.

    There are already code quality metrics.

    SAST and DAST tools can score or fix code, as part of a LLM-driven development loop.

    Formal verification is maybe the best code quality metric.

    Is there more than Product-Market fit and infosec liabilities?