Comment by conductrics

13 years ago

I think rather than get hung up on e-greedy vs. A/B testing vs UCB (Bayesian vs. non Bayesian), it is helpful to first step back and think about the larger problem of online learning as a form of the prediction/control problem. The joint problem is to 1)LEARN (estimate) the values of possible courses of action in order to predict outcomes. and 2)CONTROL the application by selecting the best action for a particular situation.

I noted elsewhere that A/B can be though of as an epsilon-first learning approach, Play random 100% till P-value<alpha, then play greedy(play the 'winner'). As an aside, it is unclear to me how using p-values is a clearer, easier, or more efficient, decision rule for these types of problems. It is almost always misinterpreted as the Prob(B>A|Data), choice of alpha determines threshold but is arbitrary, and often a straw-man default - implicitly biasing your confusion matrix. Not saying that you won't get good results, just that it is not clear that is a dominate approach.

This simple post I wrote on agents and online learning might be informative http://mgershoff.wordpress.com/2011/10/30/intelligent-agents...

But don't take my word for it (disclaimer: I work for www.conductrics.com, which provides decision optimization as a service) take a look at a great intro text on the topic by Sutton & Barto http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.ht...