Comment by timr
13 years ago
"After a few days of testing, can't you take a look at the statistics and analyze them the same way that you would for an A/B test? Then just stop testing?"
No, because the assumptions that underpin many statistical techniques are violated when you're not assigning people to cohorts consistently, and at random.
If you are using an epsilon-greedy approach (or something similar), then I believe that the data collected during the exploration portion - (the random calls) are open, albeit with less power due to reduced sample size, to standard hypothesis testing. Think of it this way, you might normally run your experiment on a subset of your traffic (population) - so only 20%, with the rest (80%) getting the current experience. With the e-greedy type of approach you are just swapping the 'current experience' with the maximum estimated experience, but that other 20% is still a random draw.
The draws aren't independent. At any given time, the probability of assigning a user to a cohort is dependent upon a function of the previous observations (in other words, it's a markov model).
The standard confidence tests -- t-tests, G-tests, chi-squared tests, etc. -- based on distributions of independent, identically distributed (iid) data.
I'd have to think about it more, but I believe that btilly's examples are also the most intuitive reasons why independence matters. If your data is time-dependent, then assigning users to cohorts based on past performance lets the time dependency dominate. There may be other good examples.
Is that true in the e-greedy case? Sure, during the exploit call, they are not independent, but during the explore portion I would assume they are, since they have been randomly assigned into the exploration pool (epsilon) and then drawn from a uniform random draw. There is no information that I can see from prior draws being used.