Comment by purple-leafy

16 hours ago

Benchmarks are great, but I feel like there’s a better way this seems quite subjective.

What you really need is an objective benchmark

7 comments

purple-leafy

Reply

eli 16 hours ago

I actually really like subjective benchmarks, so long as it's a human (ideally me) grading the results. LLM as judge never made much sense.

charcircuit 15 hours ago
The issue is that you can't do unsupervised learning if you require humans.
- eli 7 hours ago
  
  Obviously there are advantages to not having to do work yourself.
  But for a benchmark with the goal of picking a model to replace a human on some task? I really think the human should judge which is best.
  I haven’t gotten very far yet but I had an idea for a personalized benchmark tool that walks through your git history and helps you craft prompts for tasks that bugs or features already implemented by hand so you can compare how different LLMs would do it.
- rhdunn 13 hours ago
  
  LLMs grading the answers is relying on the LLM knowing the answer and not just hallucinating it. You also have issues if/when the model refuses to answer, or if it gets stuck in a loop (e.g. if running locally with a heavily quantized model).
  I'm investigating/experimenting with using traditional NLP (stanza, spaCy, etc.) to try and grade the responses according to different metrics (is the response in first/second/third person?, is it written as poetry, prose, or drama? etc.). I'm also thinking about using information extraction and synonym detection to handle data queries and the like.
  
  1 reply →

echelon 16 hours ago

> What you really need is an objective benchmark

"When are all the software engineers unemployed?"

purple-leafy 16 hours ago

Not sure I follow haha