← Back to context

Comment by gregsadetsky

3 months ago

I'm new/uninformed in this world, but I have an idea for an eval that I think has not been tried yet.

Can anyone direct me towards how to ... make one? At the most fundamental level, is it about having test questions with known, golden (verified, valid) answers, and asking different LLM models to find the answer, and comparing scores (how many were found to be correct)?

What are "obvious" things that are important to get right - temperature set to 0? At least ~10 or 20 attempts at the same problem for each llm? What are non-obvious gotchas?

Finally, any known/commonly used frameworks to do this, or any tooling that can call different LLMs would be enough?

Thanks!

> Can anyone direct me towards how to ... make one?

https://hamel.dev/blog/posts/evals/

> What are "obvious" things that are important to get right - temperature set to 0? At least ~10 or 20 attempts at the same problem for each llm?

LLMs are actually pretty deterministic, so there is no need to do more than one attempt with the exact same data.

> Finally, any known/commonly used frameworks to do this, or any tooling that can call different LLMs would be enough?

https://github.com/vercel/ai

https://github.com/mattpocock/evalite

  • "LLMs are actually pretty deterministic, so there is no need to do more than one attempt with the exact same data."

    Is this true? I remember there being a randomization factor in weighing tokens to make the output more something, dont recall what

    Obviously I'm not an Ai dev