Comment by mitjam

6 days ago

Really love this.

Would love to see an actual end to end example video of you creating, planning, and implementing a task using your preferred models and apps.

5 comments

mitjam

mrlesk 6 days ago

Will definitely do. I am also planning to run a benchmark with various models to see which one is more effective at building a full product starting from a PRD and using backlog for managing tasks

bazooka5798 6 days ago

I'd love to see openRouter connectivity to try non Claude models for some of the planning parts of the cycle.
westurner 6 days ago
Is there an established benchmark for building a full product?
- SWE-bench leaderboard: https://github.com/FoundationAgents/MetaGPT :
> Software Company as Multi-Agent System
> MetaGPT takes a one line requirement as input and outputs user stories / competitive analysis / requirements / data structures / APIs / documents, etc. Internally, MetaGPT includes product managers / architects / project managers / engineers. It provides the entire process of a software company along with carefully orchestrated SOPs.
- Mutation-Guided LLM-based Test Generation: https://github.com/codefuse-ai/Awesome-Code-LLM :
> 8.2 Benchmarks: Integrated Benchmarks, Evaluation Metrics, Program Synthesis, Visually Grounded Program, Synthesis, Code Reasoning and QA, Text-to-SQL, Code Translation, Program Repair, Code Summarization, Defect/Vulnerability Detection, Code Retrieval, Type Inference, Commit Message Generation, Repo-Level Coding
- underlines/awesome-ml/tools.md > Benchmarking: https://arxiv.org/abs/2402.00350
- Leave_OAI_Alone 5 days ago
  
  You have compiled an interesting list of benchmarks and adjacent research. The implicit question is whether an established benchmark for building a full product exists.
  After reviewing all this, what is your actual conclusion, or are you asking? Is the takeaway that a comprehensive benchmark exists and we should be using it, or is the takeaway that the problem space is too multifaceted for any single benchmark to be meaningful?
  
  1 reply →