Comment by pants2

3 months ago

The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job.

8 comments

pants2

airstrike 3 months ago

If you and others have any insights to share on structuring that benchmark, I'm all ears.

There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.

The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.

pants2 3 months ago

Generally, the easiest:
1. Sample a set of prompts / answers from historical usage.
2. Run that through various frontier models again and if they don't agree on some answers, hand-pick what you're looking for.
3. Test different models using OpenRouter and score each along cost / speed / accuracy dimensions against your test set.
4. Analyze the results and pick the best, then prompt-optimize to make it even better. Repeat as needed.

dotancohen 3 months ago

How do you find and decide which obscure models to test? Do you manually review the model card for each new model on Hugging Face? Is there a better resource?

pants2 3 months ago
Just grab the top ~30 models on OpenRouter[1] and test them all. If that's too expensive make a sample 'screening' benchmark that's just a few of the hardest problems to see if it's even worth the full benchmark.
1. https://openrouter.ai/models?order=top-weekly&fmt=table
- dotancohen 3 months ago
  
  Thank you! I'll see about building a test suite.
  Do you compare models' output subjectively, manually? Or do you have some objective measures? My use case would be to test diagnostic information summaries - the output is free text, not structured. The only way I can think to automate that would be with another LLM.
  Advice welcome!
  
  3 replies →