← Back to context

Comment by koakuma-chan

3 months ago

> Can anyone direct me towards how to ... make one?

https://hamel.dev/blog/posts/evals/

> What are "obvious" things that are important to get right - temperature set to 0? At least ~10 or 20 attempts at the same problem for each llm?

LLMs are actually pretty deterministic, so there is no need to do more than one attempt with the exact same data.

> Finally, any known/commonly used frameworks to do this, or any tooling that can call different LLMs would be enough?

https://github.com/vercel/ai

https://github.com/mattpocock/evalite

"LLMs are actually pretty deterministic, so there is no need to do more than one attempt with the exact same data."

Is this true? I remember there being a randomization factor in weighing tokens to make the output more something, dont recall what

Obviously I'm not an Ai dev