Comment by DoctorOetker

19 days ago

I have the impression the implied conclusion is that under the situation described it would be better to consult different LLM models, than a specific one, but that is not what they demonstrate:

to demonstrate this you measure the compute / cost of running and human-verifying the output.

the statistics provided don't at all exclude the possibility that instead of giving the top 5 models each a single opportunity to propose a solution, it may be more efficient to give the 5 opportunities to solve the problem to the best scoring model:

at 24% win rate the null hypothesis (what a usual researcher ought to predict based on common sense) would be that the probability of a loss is 76%, and the probability that it loses N times is (0.76 ^ N), and so the probability of it winning in N attempts is ( 1 - (0.76 ^ N ) ).

So consulting the best scoring model twice (2 x top-1) I would expect: 42.24% better than the giving the 2 top scoring models each a single try ( 1 x top-2 ) as that resulted in 35%

Same for 3x top-1 vs 1x top-3: 56.10% vs 51%

Same for 4x top-1 vs 1x top-4: 66.63% vs 66%

Same for 5x top-1 vs 1x top-5: 74.64% vs 73%

Same for 6x top-1 vs 1x top-6: 80.73% vs 83%

Same for 7x top-1 vs 1x top-7: 85.35% vs 90%

Same for 8x top-1 vs 1x top-8: 88.87% vs 95%

I can't read the numerical error bars on the top-1 model win rate, we could calculate a likelihood from to see if the deviation is statistically significant.

2 comments

DoctorOetker

languid-photic 19 days ago

Good point.

This post measures `1x top-N` (one attempt each from N models), not `Nx top-1` (N attempts from the best-scoring model). We should make that more clear.

Part of why we chose `1x top-N` is that we expect lower error correlation compared to `Nx top-1`, which is also why the iid baseline is likely optimistic.

That said, a direct comparison (`Nx top-1` vs `1x top-N`, with the same review/compute budget) would be useful!

DoctorOetker 19 days ago

Would you mind sharing the raw results that produced the 1x-top-1 result 24% and the raw results and computation for the error bars?
I would like to continue the likelihood calculation.