Comment by brookst
1 day ago
> IMO you can never use an AI agent benchmark that is published on the internet more than once.
This is a long-solved problem far predating AI.
You do it by releasing 90% of the benchmark publicly and holding back 10% for yourself or closely trusted partners.
Then benchmark performance can be independently evaluated to determine if performance on the 10% holdback matches the 90% public.
No comments yet
Contribute on Hacker News ↗