Comment by sdenton4

4 months ago

Taking a slightly closer look at the paper, you've got K repositories and create a set of test cases within each repository, totaling 130-ish tests. There may be some 'repository-level' effects - ie, tasks may be easier in some repo's than others.

Modeling the overall success rate then requires some hierarchical modeling. You can consider each repository as a weighted coin, and each test within a repository as flip of that particular coin. You want to estimate the overall probability of getting heads, when choosing a coin at random and then flipping it.

Here's some Gemini hints on how to proceed with getting the confidence interval using hierarchical bayes: https://gemini.google.com/corp/app/e9de6a12becc57f6