Comment by loehnsberg

15 hours ago

Unless we do our own benchmarks, we have to take all the marketing fluff from the frontier labs at face value, and all public benchmarks degrade eventually as labs optimize towards them. OP’s approach is wasteful because it is brute force, but post says that an ELO is kept, so this is also an experiment, and I don‘t see what‘s wrong with that. You learn which model performs well in which settings which may save resources later. It‘s also wasteful to keep working with the wrong model/harness/tools for too long.