Comment by mitjam
6 days ago
Really love this.
Would love to see an actual end to end example video of you creating, planning, and implementing a task using your preferred models and apps.
6 days ago
Really love this.
Would love to see an actual end to end example video of you creating, planning, and implementing a task using your preferred models and apps.
Will definitely do. I am also planning to run a benchmark with various models to see which one is more effective at building a full product starting from a PRD and using backlog for managing tasks
I'd love to see openRouter connectivity to try non Claude models for some of the planning parts of the cycle.
Is there an established benchmark for building a full product?
- SWE-bench leaderboard: https://github.com/FoundationAgents/MetaGPT :
> Software Company as Multi-Agent System
> MetaGPT takes a one line requirement as input and outputs user stories / competitive analysis / requirements / data structures / APIs / documents, etc. Internally, MetaGPT includes product managers / architects / project managers / engineers. It provides the entire process of a software company along with carefully orchestrated SOPs.
- Mutation-Guided LLM-based Test Generation: https://github.com/codefuse-ai/Awesome-Code-LLM :
> 8.2 Benchmarks: Integrated Benchmarks, Evaluation Metrics, Program Synthesis, Visually Grounded Program, Synthesis, Code Reasoning and QA, Text-to-SQL, Code Translation, Program Repair, Code Summarization, Defect/Vulnerability Detection, Code Retrieval, Type Inference, Commit Message Generation, Repo-Level Coding
- underlines/awesome-ml/tools.md > Benchmarking: https://arxiv.org/abs/2402.00350
You have compiled an interesting list of benchmarks and adjacent research. The implicit question is whether an established benchmark for building a full product exists.
After reviewing all this, what is your actual conclusion, or are you asking? Is the takeaway that a comprehensive benchmark exists and we should be using it, or is the takeaway that the problem space is too multifaceted for any single benchmark to be meaningful?
1 reply →