Comment by purple-leafy
16 hours ago
Benchmarks are great, but I feel like there’s a better way this seems quite subjective.
What you really need is an objective benchmark
16 hours ago
Benchmarks are great, but I feel like there’s a better way this seems quite subjective.
What you really need is an objective benchmark
I actually really like subjective benchmarks, so long as it's a human (ideally me) grading the results. LLM as judge never made much sense.
The issue is that you can't do unsupervised learning if you require humans.
Obviously there are advantages to not having to do work yourself.
But for a benchmark with the goal of picking a model to replace a human on some task? I really think the human should judge which is best.
I haven’t gotten very far yet but I had an idea for a personalized benchmark tool that walks through your git history and helps you craft prompts for tasks that bugs or features already implemented by hand so you can compare how different LLMs would do it.
LLMs grading the answers is relying on the LLM knowing the answer and not just hallucinating it. You also have issues if/when the model refuses to answer, or if it gets stuck in a loop (e.g. if running locally with a heavily quantized model).
I'm investigating/experimenting with using traditional NLP (stanza, spaCy, etc.) to try and grade the responses according to different metrics (is the response in first/second/third person?, is it written as poetry, prose, or drama? etc.). I'm also thinking about using information extraction and synonym detection to handle data queries and the like.
1 reply →
> What you really need is an objective benchmark
"When are all the software engineers unemployed?"
Not sure I follow haha