← Back to context

Comment by dotancohen

3 months ago

How do you find and decide which obscure models to test? Do you manually review the model card for each new model on Hugging Face? Is there a better resource?

Just grab the top ~30 models on OpenRouter[1] and test them all. If that's too expensive make a sample 'screening' benchmark that's just a few of the hardest problems to see if it's even worth the full benchmark.

1. https://openrouter.ai/models?order=top-weekly&fmt=table

  • Thank you! I'll see about building a test suite.

    Do you compare models' output subjectively, manually? Or do you have some objective measures? My use case would be to test diagnostic information summaries - the output is free text, not structured. The only way I can think to automate that would be with another LLM.

    Advice welcome!

    • Yeah - things are easy when you can objectively score an output, otherwise as you said you'll probably need another LLM to score it. For summaries you can try to make that somewhat more objective, like length and "8/10 key points are covered in this summary."

      This is a real training method (like Group Relative Policy Optimization), so it's a legitimate approach.

      2 replies →