Comment by brookst

1 day ago

> IMO you can never use an AI agent benchmark that is published on the internet more than once.

This is a long-solved problem far predating AI.

You do it by releasing 90% of the benchmark publicly and holding back 10% for yourself or closely trusted partners.

Then benchmark performance can be independently evaluated to determine if performance on the 10% holdback matches the 90% public.