Comment by conductrics

13 years ago

This is part of a larger class of problems known as reinforcement learning problems. A/B testing when used for decision optimization can be thought of (sort of) as just a form of bandit using an epsilon-first approach. You play random until some threshold (using some sort of arbitrary hypothesis test), which is the learning period, then you exploit your knowledge and play estimate best option. Epsilon-greedy is nice because it tends to work well regardless, and isn't completely affected by drift (nonstationarity of the environment). One heuristic to use for deciding between using a bandity approach is to ask , is the information I will glean perishable or not pershible? For perishable problems the opportunity cost to learn is quite high, since you have less time to recoup your investment in learning (reducing the uncertainty in your estimates). Also, finding the optimal answer in these situations may be less important than just ensuring that you are playing from the set of high performing actions. We have a couple of blog posts on related issues http://www.conductrics.com/blog/