← Back to context Comment by operatingthetan 19 hours ago Would creating new benchmarks every month solve this problem? 1 comment operatingthetan Reply preciousoo 19 hours ago Or create "blind" benchmarks.10 groups of 3 researchers, all have their own benchmarks that they do not share (testing it without the authors knowing is a different problem, maybe they only run the benchmarks when the gen-pop has access to the models).that's 10 different tests. Aggregate pass rates
preciousoo 19 hours ago Or create "blind" benchmarks.10 groups of 3 researchers, all have their own benchmarks that they do not share (testing it without the authors knowing is a different problem, maybe they only run the benchmarks when the gen-pop has access to the models).that's 10 different tests. Aggregate pass rates
Or create "blind" benchmarks.
10 groups of 3 researchers, all have their own benchmarks that they do not share (testing it without the authors knowing is a different problem, maybe they only run the benchmarks when the gen-pop has access to the models).
that's 10 different tests. Aggregate pass rates