Comment by gregsadetsky

3 months ago

I'm new/uninformed in this world, but I have an idea for an eval that I think has not been tried yet.

Can anyone direct me towards how to ... make one? At the most fundamental level, is it about having test questions with known, golden (verified, valid) answers, and asking different LLM models to find the answer, and comparing scores (how many were found to be correct)?

What are "obvious" things that are important to get right - temperature set to 0? At least ~10 or 20 attempts at the same problem for each llm? What are non-obvious gotchas?

Finally, any known/commonly used frameworks to do this, or any tooling that can call different LLMs would be enough?

Thanks!

5 comments

gregsadetsky

koakuma-chan 3 months ago

> Can anyone direct me towards how to ... make one?

https://hamel.dev/blog/posts/evals/

> What are "obvious" things that are important to get right - temperature set to 0? At least ~10 or 20 attempts at the same problem for each llm?

LLMs are actually pretty deterministic, so there is no need to do more than one attempt with the exact same data.

> Finally, any known/commonly used frameworks to do this, or any tooling that can call different LLMs would be enough?

https://github.com/vercel/ai

https://github.com/mattpocock/evalite

ncgl 3 months ago
"LLMs are actually pretty deterministic, so there is no need to do more than one attempt with the exact same data."
Is this true? I remember there being a randomization factor in weighing tokens to make the output more something, dont recall what
Obviously I'm not an Ai dev
- koakuma-chan 3 months ago
  
  In my experience, the response may not be exactly the same, but the difference is negligible.
gregsadetsky 3 months ago

I'm very grateful! Thanks a lot

moltar 3 months ago

Take a look at promptfoo