Comment by chr15m
18 days ago
Ah interesting. Thank you very much for sharing the illuminating results.
One question I had - was the judgement blinded? Did judges know which models produced which output?
18 days ago
Ah interesting. Thank you very much for sharing the illuminating results.
One question I had - was the judgement blinded? Did judges know which models produced which output?
It was not, the agent id is not overt but can be found via the workspace filepath.
But that is a good point. Perhaps it should be mapped to something unidentifiable.
Ah ok. If you do run it again that would be a worthwhile change. I know I personally have biases about models and I have seen others commenting the same - it seems likely it would skew the results at least a little.
Nonetheless you've convinced me to try an even wider variety of models, thanks!
In fact, this makes me think I should add this as a feature to my AI dev tooling - compare responses side by side and pick the best one.