Comment by vibrunazo
13 years ago
Why can't you do the same study from the bandit results? From what I understand, this the same as A/B testing, except it will only show a suboptimal result to 10% of the users instead of 50% (or more). After a few days of testing, can't you take a look at the statistics and analyze them the same way that you would for an A/B test? Then just stop testing?
The A/B test just gives you the conversion rate of each option. And so does the bandit. As I understand, the only difference is that the bandit will be bugging your users with bad results less often.
Sure, but wouldn't that take longer to get us our answer, and therefore keep us from moving on to the next experiment sooner?
Of course, the real answer to why not is "we have an A/B system and I'm not going to add bandit stuff for no benefit". But even if I were doing it from scratch, it seems more complex. The benefit of these approaches seems to be that one no longer has to think. We want to think.
"After a few days of testing, can't you take a look at the statistics and analyze them the same way that you would for an A/B test? Then just stop testing?"
No, because the assumptions that underpin many statistical techniques are violated when you're not assigning people to cohorts consistently, and at random.
If you are using an epsilon-greedy approach (or something similar), then I believe that the data collected during the exploration portion - (the random calls) are open, albeit with less power due to reduced sample size, to standard hypothesis testing. Think of it this way, you might normally run your experiment on a subset of your traffic (population) - so only 20%, with the rest (80%) getting the current experience. With the e-greedy type of approach you are just swapping the 'current experience' with the maximum estimated experience, but that other 20% is still a random draw.
The draws aren't independent. At any given time, the probability of assigning a user to a cohort is dependent upon a function of the previous observations (in other words, it's a markov model).
The standard confidence tests -- t-tests, G-tests, chi-squared tests, etc. -- based on distributions of independent, identically distributed (iid) data.
I'd have to think about it more, but I believe that btilly's examples are also the most intuitive reasons why independence matters. If your data is time-dependent, then assigning users to cohorts based on past performance lets the time dependency dominate. There may be other good examples.
1 reply →