Comment by imiric

4 months ago

That's a sensible approach, but it still won't give you 100% confidence. These tools produce different output even when given the same context and prompt. You can't really be certain that the output difference is due to isolating any single variable.

2 comments

imiric

pamelafox 4 months ago

So true! I've also setup automated evaluations using the GitHub Copilot SDK so that I can re-run the same prompt and measure results. I only use that when I want even more confidence, and typically when I want to more precisely compare models. I do find that the results have been fairly similar across runs for the same model/prompt/settings, even though we cannot set seed for most models/agents.

ChrisGreenHeur 4 months ago

same with people, no matter what info you give a person you cant be sure they will follow it the same every time