← Back to context

Comment by mitjam

6 days ago

Really love this.

Would love to see an actual end to end example video of you creating, planning, and implementing a task using your preferred models and apps.

Will definitely do. I am also planning to run a benchmark with various models to see which one is more effective at building a full product starting from a PRD and using backlog for managing tasks

  • I'd love to see openRouter connectivity to try non Claude models for some of the planning parts of the cycle.

  • Is there an established benchmark for building a full product?

    - SWE-bench leaderboard: https://github.com/FoundationAgents/MetaGPT :

    > Software Company as Multi-Agent System

    > MetaGPT takes a one line requirement as input and outputs user stories / competitive analysis / requirements / data structures / APIs / documents, etc. Internally, MetaGPT includes product managers / architects / project managers / engineers. It provides the entire process of a software company along with carefully orchestrated SOPs.

    - Mutation-Guided LLM-based Test Generation: https://github.com/codefuse-ai/Awesome-Code-LLM :

    > 8.2 Benchmarks: Integrated Benchmarks, Evaluation Metrics, Program Synthesis, Visually Grounded Program, Synthesis, Code Reasoning and QA, Text-to-SQL, Code Translation, Program Repair, Code Summarization, Defect/Vulnerability Detection, Code Retrieval, Type Inference, Commit Message Generation, Repo-Level Coding

    - underlines/awesome-ml/tools.md > Benchmarking: https://arxiv.org/abs/2402.00350

    • You have compiled an interesting list of benchmarks and adjacent research. The implicit question is whether an established benchmark for building a full product exists.

      After reviewing all this, what is your actual conclusion, or are you asking? Is the takeaway that a comprehensive benchmark exists and we should be using it, or is the takeaway that the problem space is too multifaceted for any single benchmark to be meaningful?

      1 reply →