Comment by DoctorOetker
19 days ago
I have the impression the implied conclusion is that under the situation described it would be better to consult different LLM models, than a specific one, but that is not what they demonstrate:
to demonstrate this you measure the compute / cost of running and human-verifying the output.
the statistics provided don't at all exclude the possibility that instead of giving the top 5 models each a single opportunity to propose a solution, it may be more efficient to give the 5 opportunities to solve the problem to the best scoring model:
at 24% win rate the null hypothesis (what a usual researcher ought to predict based on common sense) would be that the probability of a loss is 76%, and the probability that it loses N times is (0.76 ^ N), and so the probability of it winning in N attempts is ( 1 - (0.76 ^ N ) ).
So consulting the best scoring model twice (2 x top-1) I would expect: 42.24% better than the giving the 2 top scoring models each a single try ( 1 x top-2 ) as that resulted in 35%
Same for 3x top-1 vs 1x top-3: 56.10% vs 51%
Same for 4x top-1 vs 1x top-4: 66.63% vs 66%
Same for 5x top-1 vs 1x top-5: 74.64% vs 73%
Same for 6x top-1 vs 1x top-6: 80.73% vs 83%
Same for 7x top-1 vs 1x top-7: 85.35% vs 90%
Same for 8x top-1 vs 1x top-8: 88.87% vs 95%
I can't read the numerical error bars on the top-1 model win rate, we could calculate a likelihood from to see if the deviation is statistically significant.
Good point.
This post measures `1x top-N` (one attempt each from N models), not `Nx top-1` (N attempts from the best-scoring model). We should make that more clear.
Part of why we chose `1x top-N` is that we expect lower error correlation compared to `Nx top-1`, which is also why the iid baseline is likely optimistic.
That said, a direct comparison (`Nx top-1` vs `1x top-N`, with the same review/compute budget) would be useful!
Would you mind sharing the raw results that produced the 1x-top-1 result 24% and the raw results and computation for the error bars?
I would like to continue the likelihood calculation.