Comment by timr

13 years ago

Bandit optimization has been discussed previously on HN:

always better than A/B testing. In particular, bandit approaches take longer to converge (on average), and don't give you reliable ways to know when to stop testing (when all you know is that you're using an approach that's optimal in the limit of a large N, your only guarantee is that things get better as N gets large). These techniques also make assumptions that aren't valid for a lot of web experiments: identical "bandit" distributions that are constant over time. Throw a few choices that are optimal at different times of day/week/month/year at a bandit optimizer, and it'll just happily fluctuate between them.

Also, there's a lot of variation in performance depending on the parameters of your test -- some of which are completely unknowable. So if you want to really learn about this method, you need to read more than a blog post where the author has concluded that bandit optimization is the new pink. For example, here's a pretty readable paper that does an empirical analysis of the various popular bandit algorithms in different paramterizations:

18 comments

timr

Reply

The really scary drawback is what happens if the bandit prefers a suboptimal choice at the same time that you make an independent improvement in your website. Then the bandit is going to add a lot of data for that variation, all of which looks really good for reasons that have nothing to do with what it is supposed to be testing.

This type of error (which can happen very easily on a website going through continuous improvement) can take a very long time to recover from.

A/B tests do not have an issue with this because all versions will have similar mixes of data from before and after the improvement.

  • This seems like a trivial thing to fix by presenting optimal with limited noise. Let's say it picks the optimal choice x% of the time (some really high number), and when additional changes are made or automatically detected, this percentage drops. If you pick the next most optimal down the line through all of your options, and make x proportional to the period of time since the last change, it should make it reasonably resistant to this kind of biasing in the first place, and can ramp back up at a reasonable rate.

    Better yet, make x dependent in some way on time since the last change, and relative change in performance of all options from before and after the change.

  • I might not be understanding you correctly, but wouldn't the independent improvement also help the random bandit choices? If you are using the forgetting factor this shouldn't be a real issue.

    My problem with the bandit method is that I want to show the same test choice to the same person every time he sees the page so you can hide that there is a test. If I do this with the bandit algo then it warps the results because different cohorts have different weightings of the choices and differing cohorts behave very differently for lots of reasons.

    • The independent improvement also helps the random bandit choices. The problem is that you are comparing A from a largely new population with Bs that are mostly from an old population. It takes a long time to accumulate enough new Bs to resolve this issue.

      A forgetting factor will help.

      This is a variant of the cohort issue that you're talking about.

      The cohort issue that you're talking about raises another interesting problem. If you have a population of active users, and you want to test per user, you often will find that your test population ramps up very quickly until most active users are in, and then slows down. The window where most users are assigned is a period where you have poor data (you have not collected for long, users have not necessarily had time to go to final sale).

      It seems to me that if you want to use a bandit method in this scenario, you'd be strongly advised to make your fundamental unit the impression, and not the user. But then you can't hide the fact that the test is going on. Whether or not this is acceptable is a business problem, and the answer is not always going to be yes.

Agreed. And I'd add that for me, the point of A/B testing is to learn something. We're not just interested in whether A or B is better; we're interesting in getting better at doing what we do. Studying the A/B results is half the fun.

  • "We're not just interested in whether A or B is better; we're interesting in getting better at doing what we do."

    Absolutely. That's my biggest objection to bandit methods, but it's also the fuzziest objection, and the one least likely to appeal to hyper-analytical people. There's a strong temptation (as we can see from this article) is to treat bandit optimization as a black box that just spits out an infallible answer (i.e. as a Lazy Button).

    It's the same human tendency that has led to "you should follow me on twitter" to be one of the more common n-grams on the interwebs (even though it probably never worked for more than Dustin Curtis, and likely causes a counter-intuitive backlash now).

    • Well put. I get why they don't get it, but from my perspective it looks like technical gold-plating a lot of the time. I generally aim to amplify the power of the human mind, not eliminate it from the system.

      Of course, these algorithms are cool in certain contexts where you never want to think about it again. For example, ad placement for a giant ad network running against a lot of UGC content. They remind me of neural networks like that: the post office doesn't care how the machine recognizes hand-written zip codes, just as long as they do it reliably.

      But startups are all about learning, and interface design is where you have direct contact with users. I want to soak in the data. I want to hang out in people's living rooms when they're using my site. A fancy Lazy Button (great phrase) is the last thing I need.

  • Why can't you do the same study from the bandit results? From what I understand, this the same as A/B testing, except it will only show a suboptimal result to 10% of the users instead of 50% (or more). After a few days of testing, can't you take a look at the statistics and analyze them the same way that you would for an A/B test? Then just stop testing?

    The A/B test just gives you the conversion rate of each option. And so does the bandit. As I understand, the only difference is that the bandit will be bugging your users with bad results less often.

    • Sure, but wouldn't that take longer to get us our answer, and therefore keep us from moving on to the next experiment sooner?

      Of course, the real answer to why not is "we have an A/B system and I'm not going to add bandit stuff for no benefit". But even if I were doing it from scratch, it seems more complex. The benefit of these approaches seems to be that one no longer has to think. We want to think.

    • "After a few days of testing, can't you take a look at the statistics and analyze them the same way that you would for an A/B test? Then just stop testing?"

      No, because the assumptions that underpin many statistical techniques are violated when you're not assigning people to cohorts consistently, and at random.

      3 replies →

"and don't give you reliable ways to know when to stop testing"

Stopping a test when you reach a "statistically significant" result is the wrong way to do A/B testing. In both multi-armed bandit and A/B testing you need to set ahead of time the number of users you are going to run your test against and stop the test at that point regardless of if your result is significant or not.

  • In theory, yes. In practice, no.

    See http://elem.com/~btilly/effective-ab-testing/index.html#asli... for part of a presentation that I did where I actually set up some reasonable fake tests, and ran simulations. What I found is that if there is a significant difference, the probability of coming to the wrong conclusion was (as you would expect) higher, but not that high before the underlying difference made mistakes incredibly unlikely. Conversely if there is only a small real difference, the amount of data needed before you have a significant chance of having accidentally come to a erroneous conclusion is very, very long.

    So avoid accepting any result where you don't have at least a few hundred successes and set your thresholds reasonably high. You will make fewer mistakes than you probably fear, and the ones that you make will almost always be very minor. (Oops, I had a 3% chance of accepting the 1% worse solution as probably better.)

    Of course if you're aiming to publish academic research, your standards need to be higher. But if you're more interested in getting useful results than publishable ones, you can relax your standards. A lot.

  • "Stopping a test when you reach a "statistically significant" result is the wrong way to do A/B testing."

    Nobody said that it was. But when you do regular split testing, you can use power analysis to estimate the length of time you need to run an experiment to get a significant result at a certain precision:

    http://en.wikipedia.org/wiki/Statistical_power

    You can't do this (at least, not easily) when you're using bandit models, because none of the assumptions are valid.