Comment by pants2
3 months ago
The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job.
If you and others have any insights to share on structuring that benchmark, I'm all ears.
There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.
The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.
Generally, the easiest:
1. Sample a set of prompts / answers from historical usage.
2. Run that through various frontier models again and if they don't agree on some answers, hand-pick what you're looking for.
3. Test different models using OpenRouter and score each along cost / speed / accuracy dimensions against your test set.
4. Analyze the results and pick the best, then prompt-optimize to make it even better. Repeat as needed.
How do you find and decide which obscure models to test? Do you manually review the model card for each new model on Hugging Face? Is there a better resource?
Just grab the top ~30 models on OpenRouter[1] and test them all. If that's too expensive make a sample 'screening' benchmark that's just a few of the hardest problems to see if it's even worth the full benchmark.
1. https://openrouter.ai/models?order=top-weekly&fmt=table
Thank you! I'll see about building a test suite.
Do you compare models' output subjectively, manually? Or do you have some objective measures? My use case would be to test diagnostic information summaries - the output is free text, not structured. The only way I can think to automate that would be with another LLM.
Advice welcome!
3 replies →