Comment by chr15m

18 days ago

Ah interesting. Thank you very much for sharing the illuminating results.

One question I had - was the judgement blinded? Did judges know which models produced which output?

2 comments

chr15m

It was not, the agent id is not overt but can be found via the workspace filepath.

But that is a good point. Perhaps it should be mapped to something unidentifiable.

chr15m 18 days ago

Ah ok. If you do run it again that would be a worthwhile change. I know I personally have biases about models and I have seen others commenting the same - it seems likely it would skew the results at least a little.
Nonetheless you've convinced me to try an even wider variety of models, thanks!
In fact, this makes me think I should add this as a feature to my AI dev tooling - compare responses side by side and pick the best one.