Comment by jleader
13 years ago
It seems to me that A/B (or A/B/C/...) testing as described by btilly, and epsilon-greedy multi-armed bandit optimisation as described by Steve Hanov are two points on a continuum. In A/B/... testing, you're exploring 100% of your traffic, and eventually you declare the test "done", and start exploiting 100% of your traffic (sending it to the "best" slice). You have the advantage that during the exploration phase, your assignment of users to test slices is completely random, uncorrelated with anything else. In the epsilon-greedy approach, you're exploring with 10% of your traffic, and exploiting the current best-looking choice with 90% of your traffic. The trouble is that now 90% of your data is coming from one slice, and which slice it's coming from could vary over time. This means, as Ben points out, that your choice of slice could be 90% correlated with anything else that varies over time.
One approach would be to ignore the data from the 90% exploitation; that way, you only get 10% of the data, but its slice assignment is completely random and uncorrelated with anything else that might be happening. The trouble is that now you're running an A/B/... test on only 10% of your traffic, which means that it will converge 10x slower than if you were running it on 100% of your traffic.
However, it seems to me that the extra 90% of data that I've proposed ignoring isn't that useful, because it's only coming from one slice at a time. What you really want is to get more data from the slices you know least about. I suspect there are reinforcement learning algorithms that take into account not just the reward rate for each slice, but the current level of certainty with which the algorithm knows the reward rate, so it can collect more data about the slices it knows the least about, and stop collecting data about the slices for which it already has a fairly accurate reward estimate. The question is, are there such algorithms that can also handle non-stationary reward distributions? And how much tuning and tweaking do they require?
That data from the 90% on slice X is valuable because you're trying to confirm, or falsify, that X is better, quickly. The strategy is "look closely at this pointy thing to see if it's a needle or a sharp piece of straw."
But... "better" than the other choices, which are only getting 1/10th the traffic. The amount of traffic sent to a choice should be a function of how good you currently think it is, and how much more data you need to be sufficiently certain about that choice. So choices that have insufficient data should get more traffic, and choices that already have sufficient data to be sure they're worse than some other choice should get very little traffic. How much traffic they should get depends on how certain you are of stationarity (is that a word?).