You can really only do a fair reproducible test if the models are static and not sitting behind an api where you have no idea on how they are updated or continuously tweaked.
This particular test is heralded as some sort of breakthrough and the companies in this field are raising billions of dollars from investors and paying their star employees tens of millions.
The economic incentives to tweak, tune, or cheat are through the roof.
You can really only do a fair reproducible test if the models are static and not sitting behind an api where you have no idea on how they are updated or continuously tweaked.
This particular test is heralded as some sort of breakthrough and the companies in this field are raising billions of dollars from investors and paying their star employees tens of millions.
The economic incentives to tweak, tune, or cheat are through the roof.