Comment by chr15m

18 days ago

Ah interesting. Thank you very much for sharing the illuminating results.

One question I had - was the judgement blinded? Did judges know which models produced which output?

It was not, the agent id is not overt but can be found via the workspace filepath.

But that is a good point. Perhaps it should be mapped to something unidentifiable.

  • Ah ok. If you do run it again that would be a worthwhile change. I know I personally have biases about models and I have seen others commenting the same - it seems likely it would skew the results at least a little.

    Nonetheless you've convinced me to try an even wider variety of models, thanks!

    In fact, this makes me think I should add this as a feature to my AI dev tooling - compare responses side by side and pick the best one.