← Back to context

Comment by noelwelsh

13 years ago

First up, the sales pitch: we provide bandit optimisation SaaS at Myna: http://mynaweb.com Now, that's out of the way, let's discuss the article.

I like the epsilon-greedy algorithm because it's simple to understand and implement, and easy to extend. However, to claim "The strategy that has been shown to win out time after time in practical problems is the epsilon-greedy method" is false. The standard measure of performance is called regret. You can think of it as the number of times you choose the sub-optimal choice. It is clear that this grows linearly in e-greedy, as there is a constant probability of exploring. The same is true in A/B testing (you show 1/2 the people the suboptimal choice in the data gathering phase and then make a decision that you have some probability of getting wrong.) A good bandit algorithm has regret that grows logarithmically with time, which is a huge difference! This result holds out in practice as well. If you look at some of Yahoo's papers (John Langford, for example; sorry no links as writing this while getting the kids ready!) you'll see comparisons to e-greedy where they significantly out-perform it. We've had the same results in our testing.

Yeah, I think the problem here is that trying to be a little bit smart kind of gets you in to the space where really you should be doing things a LOT smart. A/B testing provides data that doesn't require much in the way of brains to interpret and is hard to draw poor conclusions from (beyond treating something as statistically significant that is not). Once you step off in to epsilon-greedy, you fall in to the whole reinforcement learning space.

To that end, btw, I think a service like yours is potentially quite valuable!

  • Actually, you kind of are already in the RL space when using AB testing to make online decisions, you just may not be thinking of it that way. From Sutton & Barto "Reinforcement learning is learning what to do--how to map situations to actions--so as to maximize a numerical reward signal." That is exactly what you are doing when applying A/B style hypothesis testing to inform decisions in an online application. Plus, personally, I think A/B testing is, in a way, much harder to interpret, at least most folks interpret wrong, which isn't a knock, since it is provides a non-intuitive - at least to me ;) - result.

    • Maximizing a numerical reward signal is definitely not what we're doing when we do an A/B test.

      We collect a variety of metrics. When we do an A/B test, we look at a all of them as a way of understanding what effect our change has on user behavior and long-term outcomes.

      A particular change may be intended to effect just one metric, but that's in an all-else-equal way. It's not often the case that our changes affect only one metric. And that's great, because that gives us hints as to what our next test should be.

      1 reply →

As far as I understand it, an advantage of the epsilon greedy algorithm is that it will relearn the best choice if it changes over time. Now, you could do that with a logarithmically-regretful algorithm as well, but it would take more time to relearn.