← Back to context

Comment by languid-photic

18 days ago

Good point.

This post measures `1x top-N` (one attempt each from N models), not `Nx top-1` (N attempts from the best-scoring model). We should make that more clear.

Part of why we chose `1x top-N` is that we expect lower error correlation compared to `Nx top-1`, which is also why the iid baseline is likely optimistic.

That said, a direct comparison (`Nx top-1` vs `1x top-N`, with the same review/compute budget) would be useful!

Would you mind sharing the raw results that produced the 1x-top-1 result 24% and the raw results and computation for the error bars?

I would like to continue the likelihood calculation.