← Back to context

Comment by eli

9 hours ago

Obviously there are advantages to not having to do work yourself.

But for a benchmark with the goal of picking a model to replace a human on some task? I really think the human should judge which is best.

I haven’t gotten very far yet but I had an idea for a personalized benchmark tool that walks through your git history and helps you craft prompts for tasks that bugs or features already implemented by hand so you can compare how different LLMs would do it.