20 lines of code that beat A/B testing every time

13 years ago (stevehanov.ca)

A stevehanov.ca link? Wow, HN is getting classy again. Please more articles with code, equations and / well visualizations, and less upvoting of badly thought out infograpics (i.e. pretty numbers which would lose nothing by just being presented in a table) and far less self-help pseudo business articles please.

+1 on an article does not mean "I agree". It means "I learnt something".

  • I upvoted you because I agree.

    But I usually upvote stories based on how interesting they are and whether I feel like they were worth attention. Comments are a mix bag of agreeing, interesting and just well thought out arguments that make me think.

    This is completely meta and kind of offtopic, so it should probably be downvoted, but I felt like saying it anyway.

  • I also +1 a comment where I learn something.

    As in meat-space voting, you should always reward good behavior.. be part of the fitness function.

  • This is thought-provoking, which is good. However there are significant issues with the approach.

    1. Real world performance varies over time. For instance there are typically daily, weekly and monthly conversion rate fluctuations. Not an issue for A/B testing, but a big issue for this approach if a random switch in direction happens at the same time that conversion fluctuations happen to head in a good direction.

    2. (This is really a special case of #1 - but a very, very important special case.) This approach creates long-lasting interaction effects between tests and independent changes. That requires explanation. Suppose you're running a test. Version A is somewhat better. But version B is temporarily looking slightly better when you make a significant improvement to your website (maybe you started another test that works). Now you're adding a lot of good traffic to version B (good because of the other change) and very little of the new good traffic to version A. This new version B traffic soundly beats your old version A traffic. This correlation between time and website performance will continue until the old version A traffic is completely swamped by new version A traffic. With only 5% of your traffic going to version A, this can easily take 100x as long as your test has been running - or more. (Properly constructed A/B tests do not suffer this statistical anomaly.)

    3. Code over time gets messy. One of the most important characteristics of A/B testing is that you can delete the mess and move on. With this approach you can't - it just hangs around adding to your technical debt.

    4. Businesses are complex, and often have multiple measures they would like to balance. For instance in a recent test, conversion to click was hurt, conversion to a person who clicked 5x was helped. A/B testing let us notice that something weird was going on and think about what we really cared about. This automated approach would make a decision and could have hidden a real problem.

    5. Many tests perform differently on existing users and new users. A/B testing with proper cohort analysis can let you tease this out and decide accordingly. This approach doesn't give you that kind of sophistication.

    • These also don't make much sense to me, but to avoid getting d/ved to oblivion I'll say why.

      Firstly both you and the author aren't really quoting any maths or real world papers or anything. He's backing his up with saying that all the advertisers are using this over A/B though, which is a pretty strong argument. But it occurs to me that for most of your points to stand you need to tackle this particular paragraph:

      Like many techniques in machine learning, the simplest strategy is hard to beat. More complicated techniques are worth considering, but they may eke out only a few hundredths of a percentage point of performance. The strategy that has been shown to win out time after time in practical problems is the epsilon-greedy method.

      So to tackle you points:

      1. Only stands if you show his paragraph above to be wrong. Does epsilon-greedy only work on consistent payouts, or does it work on fluctuating payouts too? It would seem to me that this would be a common occurrence in advertising on websites. I imagine there is some research out there to settle this!

      2. He addresses this directly in the post: This won't adapt to change. (Your visitors probably don't change. But if you really want to, in the reward function, multiply the old reward value by a forgetting factor)

      3. There is no difference between this and A/B testing, the mock code he shows is supposed to go in your A/B testing framework, the code in the controllers is supposed to be the same (and you can remove it the same way).

      4. Isn't A/B testing just as bad at testing multiple factors? Why wouldn't you 'notice', you should theoretically see the same percentages for each stage. And would be able to notice the oddity.

      5. Again this only stands if you show his paragraph above to be wrong. You are suggesting that a complicated strategy will win, which he says isn't true.

      • I showed that paragraph to be wrong. Not slightly, but wildly in the common scenario where, while a test is running, you make another change to your website that lifts conversion.

        The difference in this case is not a few hundredths of a point of convergence. It is a question of potentially drawing wrong conclusions about the wrong version for 100x as long as you need to.

        3 replies →

    • I got a very different impression of this method than you did because I saw it as, just as with an A/B test, when you make a major change to the site, you reset your counters. This makes your point #2 mute. As for number 5, why couldn't you use the same cohort analysis with this method?

      • What is a major change?

        I've seen minor tweaks to a form raise conversions by 20%. The person running a particular test may not even know that a particular change was significant, or even that it happened. The change could be as subtle as another running test realized that version x is better, interfering with existing tests.

        As for #5, with an A/B test you run into these situations, you're able to break down and crunch the numbers in multiple ways, and then have a discussion about how you want to proceed. But with a multi-armed bandit approach whatever complexities you have not thought of and baked into your approach, are not going to be noticed.

      • moot.

        mute == "does not talk"

        moot == "of little or no practical value or meaning; purely academic."

        moot/mute is a tricky word combo because smart people can justify using "MUTE" instead of moot, and english is such a terrible language w.r.t. spelling there are no good clues to use one versus the other (and "moot" is far less commonly used).

        1 reply →

    • It seems to me that A/B (or A/B/C/...) testing as described by btilly, and epsilon-greedy multi-armed bandit optimisation as described by Steve Hanov are two points on a continuum. In A/B/... testing, you're exploring 100% of your traffic, and eventually you declare the test "done", and start exploiting 100% of your traffic (sending it to the "best" slice). You have the advantage that during the exploration phase, your assignment of users to test slices is completely random, uncorrelated with anything else. In the epsilon-greedy approach, you're exploring with 10% of your traffic, and exploiting the current best-looking choice with 90% of your traffic. The trouble is that now 90% of your data is coming from one slice, and which slice it's coming from could vary over time. This means, as Ben points out, that your choice of slice could be 90% correlated with anything else that varies over time.

      One approach would be to ignore the data from the 90% exploitation; that way, you only get 10% of the data, but its slice assignment is completely random and uncorrelated with anything else that might be happening. The trouble is that now you're running an A/B/... test on only 10% of your traffic, which means that it will converge 10x slower than if you were running it on 100% of your traffic.

      However, it seems to me that the extra 90% of data that I've proposed ignoring isn't that useful, because it's only coming from one slice at a time. What you really want is to get more data from the slices you know least about. I suspect there are reinforcement learning algorithms that take into account not just the reward rate for each slice, but the current level of certainty with which the algorithm knows the reward rate, so it can collect more data about the slices it knows the least about, and stop collecting data about the slices for which it already has a fairly accurate reward estimate. The question is, are there such algorithms that can also handle non-stationary reward distributions? And how much tuning and tweaking do they require?

      • That data from the 90% on slice X is valuable because you're trying to confirm, or falsify, that X is better, quickly. The strategy is "look closely at this pointy thing to see if it's a needle or a sharp piece of straw."

        1 reply →

    • re: #3... you can also remove items at will. If you look at the counters and find option C is consistently the lowest performer, then you can remove it from the code and remove the counter. It doesn't mean cruft needs to pile up.

      • But how do you know when you can do that? Particularly with the slower convergence issue, and the potential presence of confounding factors like #2?

    • So Ben, looks like it's time to cross swords again :)

      1. Real world performance varies over time...

      I don't really understand this complaint. With A/B testing, you have to collect enough data before making a decision that the conversion rate fluctuations average out. If you're worried about monthly fluctuations (which is reasonable), and you're doing A/B testing, you need to collect data for a few months. This means months where you aren't optimising. With a bandit algorithm conversion rate fluctuations will cause changes in the bandit behaviour, but they will average out over time just like A/B testing. We can prove this, so I don't know what justification there is for the statement that it is "a big issue for this approach". Indeed I think this points out a weakness of A/B approaches. Firstly, it's arguable that tracking fluctuations in conversion rate is a good thing. They may be temporary, but why shouldn't the algorithm react to temporary changes? Secondly, unlike A/B testing you're not spending months collecting data -- you're actually optimising while you collect data. Finally, what site can afford to spend months not changing, while data is collected?

      2. (This is really a special case of #1 - but a very, very important special case...

      I see two issues here: 1) the fact the traffic has changed and 2) how long does it take the system to recognise this change?

      Addressing 1: Getting technical for a moment, A/B tests assume the distribution generating the data is stationary. This situation violates this assumption, so one shouldn't be using A/B testing here. I'd like to know what is meant by a properly constructed A/B test in this case. Does it mean throwing out all the data before the significant improvement? If you say that a non-stationary distribution is fine, then you can develop a bandit algorithms for non-stationary situations (e.g. http://arxiv.org/abs/0805.3415). Myna handles non-stationary situations, so long as they change quite slowly.

      Point 2 is really about convergence rates. There is some work here (http://imagine.enpc.fr/~audibert/Mes%20articles/ALT11.pdf) and the upshot is that convergence rate can be a problem in theory though you can design for it. Notably we haven't observed this issue in practice.

      3. Code over time gets messy...

      This is a non-issue in my experience. At some point you decide that an experiment is no longer worthwhile. Either you're redesigning things so it doesn't make sense anymore, or the experiment has been running long enough that you are confident of no more changes. When you hit this point you remove the code. It can be a year or a hour after starting the experiment. Either way, you know with a bandit algorithm you will have optimised what traffic you did receive.

      4. Businesses are complex, and often have multiple measures they would like to balance...

      The glib answer to this is to set your reward criteria correctly, but more below.

      5. Many tests perform differently on existing users and new users...

      Both of these are really about performing extra analysis on the data. Myna doesn't help here, and we could make it better in this regard by, for example, exporting data. Some of this extra analysis, such as cohort analysis, we can and will automate in the future. There will always be analyses that we can't or don't perform, so there will always be jobs for people who perform these analyses :-) On the other hand, bandit algorithms are used enough in practice (e.g. Google uses them for AdWords) that their utility has been validated. It's a trade-off between time and return. We want to automate the common cases so the analysts can concentrate on the interesting parts.

      • Do you attempt to control for seasonal effects at all in your models? I'm wondering if allowing for a dependence on day-of-week, time-of-day and other relevant covariates, has any benefits over tracking fluctuations in a generic way?

      • In order to do anything in practice, you have to make assumptions. With A/B testing your key assumption is that differences in version behavior are relatively stable over time, even if absolute behavior changes. This is not always true - for instance a version that makes references to Christmas will perform differently before and after Dec 25. But it tends to be true often enough that people can rely on that assumption. (With no simplifying assumptions of any kind you get analysis paralysis - that would be bad.)

        Now to reply point by point.

        1. Real world performance varies over time...

        With the assumption that I provided above, A/B testing merely needs to collect enough data to detect real differences in user behavior. After a few days you may not know what the average long-term rate is, but you have strong evidence that one is better.

        I have absolutely no idea where you got the impression that A/B tests typically run for months, or that you can't touch anything else while it is running. How long they need to run varies wildly (depends on traffic volume, conversion rates, etc), but for many organizations a week is a fairly long test.

        2. (This is really a special case of #1 - but a very, very important special case...

        Contrary to your point, A/B tests DO NOT assume that the distribution is stationary. They make the much weaker assumption that if a relative difference is found, that difference will continue to be true later. (If this assumption is only true 95% of the time, it is still useful...)

        When you're running multiple tests the time-dependence of performance is often extreme. For instance while running test 1, you introduce test 2, it wins, and you adopt it. Now the time period you're running test 1 has 2 conversion bumps of around 5% in the middle. This is not a problem for properly constructed A/B tests. But it is a serious violation of, Myna handles non-stationary situations, so long as they change quite slowly.

        Now what is a properly constructed A/B test? A properly constructed A/B test is one where membership in any particular test version has only random correlations with any other factor that could materially affect performance. Such factors include time, and other running tests. If each user is randomly placed in a test version upon first encountering them, independently of anything else that happens, then you have a well-constructed A/B test. Your versions will be statistically apples/apples even though early and late users are apples/oranges.

        3. Code over time gets messy...

        The degree to which this will be seen to be an issue depends on personal taste, and how much you're the person who has to dive through templates with a maze of if conditions. If you talk to management, it is almost never a (visible) condition. Programmers on the ground often disagree.

        4. Businesses are complex, and often have multiple measures they would like to balance...

        Getting the business to have the necessary discussions in an abstract volume is a non-starter in my experience. When you have a concrete example, decisions are easier to make.

        Also (as happened in the test that I saw last week) often the discrepancy is a sign to another problem. A better reward criteria would have resulted in making a good pick, but looking at the discrepancy found an issue that wouldn't have been found otherwise, and we'll hopefully wind up with an even better option that we would not have found correctly.

        5. Many tests perform differently on existing users and new users...

        My point is that the extra analyses are useful. However this is really a secondary or tertiary point since most companies doing A/B testing are not going this extra mile.

        As for AdWords, Google has sufficient volume for a traditional A/B test to converge in minutes. They have so much volume that continuous adaptation can just happen for them. Most businesses are not in such a fortunate place.

    • 2. If you change the thing you're testing, you need to restart the test. Otherwise, you'll have a meaningless jumble of results from two separate things.

    • re #1,

      could this be solved with an exponential decay?

      aka, saying that a click from 1 month ago is less valuable than someone clicking today.

      By tweaking the decay you could change how quickly the algorithm will sway when the conversion rate changes.

      EDIT: I just saw that rauljara suggested this below: https://news.ycombinator.com/item?id=4040230

    • I don't think any of those points are very true.

      • I've been involved with A/B testing for nearly a decade. I assure you that none of these points are in the slightest bit hypothetical.

        1. Every kind of lead gen that I have been involved with and thought to measure has large periodic fluctuations in user behavior. Measure it, people behave differently on Friday night and Monday morning.

        2. If you're regularly running multiple tests at once, this should be a potential issue fairly frequently.

        3. If you really fire and forget, then crud will accumulate. To get rid of that you have to do the same kind of manual evaluation that was supposed to be the downside of A/B testing.

        4. Most people do not track multiple metrics on every A/B test. If so, you'll never see how it matters. I make that a standard practice, and regularly see it. (Most recently, last week. I am not at liberty to discuss details.)

        5. I first noticed this with email tests. When you change the subject line, you give an artificial boost to existing users who are curious what this new email is. New users do not see the subject line as a change. This boost can easily last long enough for an A/B test to reach significance. I've seen enough bad changes look good because of this effect that I routinely look at cohort analysis.

        2 replies →

      • #1 certainly is, particularly for businesses prone to seasonal fluctuations. In local lead-gen (for instance) you see big changes in conversion based on the time of year.

      • Wow, you convinced me...

        Sarcasm aside, I've also experienced all of these issues with real world testing and would be interested in hearing your argument as to why you think this is not the case.

        2 replies →

    This is an interesting technique, but it too has flaws. If there is a period of buzz and excitement surrounding your app, whatever design was most popular at that time will be rewarded accordingly, and accrue a high click through rate with tens of thousands of case. If you introduce a new superior design after the period of buzz has gone away, the new design may take a very long time to catch up. Even though it is currently outperforming the old, the few hundred cases that are being added won't be enough to offset the tens of thousands that came before.

    With all metrics, it's important to understand what's actually going into the measure and where it might get tripped up.

    A potential solution might be to add a decay factor, so that the older data carries less weight.

    • Better than a forgetting factor, add a Kalman filter (http://en.wikipedia.org/wiki/Kalman_filter). This way you can trust your "new" data more than really "old" data, etc. The beauty of it is that it only adds three attributes to each data sample.

    • I don't think this flaw is real.

      If "whatever design was most popular at that time" has a billion trials and 90% successes, and "a new superior design" has 100 trials and 97% successes (97 out of 100), than the new design is favored by the algorithm. No need to "catch up" to the absolute number of successes.

    • If anyone wants to search the literature, the terminology for this is the non-stationary bandit problem. The classical bandit problem is stationary but there has been plenty of research done into non-stationary variants as well.

    • Nice post. One hypothetical case it could end up serving the worse design more often is if you had times of the day where your users behave very differently due to time zones, i.e. Europe vs North America.

      Say you have a sports site and test a new soccer oriented layout vs an old baseball heavy one. In the day, the old baseball version wins easily. When NA goes to sleep it would serve up baseball to the Europeans until it loses, then after several hours soccer is the winner. But then it is too late and the Europeans go to sleep and on and on. This is an odd example and assumes equal balance, and the site should really be localized, but you get the point.

      • This is actually pretty straightforward to overcome (and one of the real strengths of the bayesian approach). Rather than using the direct counts of success for each group, add a prior belief that americans will favor baseball and non-americans will favor soccer (you can experiment to determine this number).

        You can now evaluate the results conditioned on each group (american / non-american).

        1 reply →

    • At this point you could flush out the inferior options and start anew with just the winning design and the new one.

    • You can instead of selecting according to the largest expectation select according to the largest value you achieve with a certain variance deviation from the calculated expectation. Law of large numbers will apply and you will have tighter and tighter bounds around the expectations. That is a superior and more efficient method used in monte-carlo simulations, e.g. for AI playing the game Go.

    This is the most important critique of A/B testing. It far outweighs the traditional hoopla about simultaneous inference and Bonferonni corrections.

    Epsilon greedy does well on k-armed bandit problems, but in most applications you likely can do significantly better by customizing the strategy to individual users. That's a contextual bandit and there are simple strategies that to pretty well here too. For instance:

    http://hunch.net/?p=298

    http://hunch.net/~exploration_learning/main.pdf

    http://web.mit.edu/hauser/www/Papers/Hauser_Urban_Liberali_B...

    First up, the sales pitch: we provide bandit optimisation SaaS at Myna: http://mynaweb.com Now, that's out of the way, let's discuss the article.

    I like the epsilon-greedy algorithm because it's simple to understand and implement, and easy to extend. However, to claim "The strategy that has been shown to win out time after time in practical problems is the epsilon-greedy method" is false. The standard measure of performance is called regret. You can think of it as the number of times you choose the sub-optimal choice. It is clear that this grows linearly in e-greedy, as there is a constant probability of exploring. The same is true in A/B testing (you show 1/2 the people the suboptimal choice in the data gathering phase and then make a decision that you have some probability of getting wrong.) A good bandit algorithm has regret that grows logarithmically with time, which is a huge difference! This result holds out in practice as well. If you look at some of Yahoo's papers (John Langford, for example; sorry no links as writing this while getting the kids ready!) you'll see comparisons to e-greedy where they significantly out-perform it. We've had the same results in our testing.

    • Yeah, I think the problem here is that trying to be a little bit smart kind of gets you in to the space where really you should be doing things a LOT smart. A/B testing provides data that doesn't require much in the way of brains to interpret and is hard to draw poor conclusions from (beyond treating something as statistically significant that is not). Once you step off in to epsilon-greedy, you fall in to the whole reinforcement learning space.

      To that end, btw, I think a service like yours is potentially quite valuable!

      • Actually, you kind of are already in the RL space when using AB testing to make online decisions, you just may not be thinking of it that way. From Sutton & Barto "Reinforcement learning is learning what to do--how to map situations to actions--so as to maximize a numerical reward signal." That is exactly what you are doing when applying A/B style hypothesis testing to inform decisions in an online application. Plus, personally, I think A/B testing is, in a way, much harder to interpret, at least most folks interpret wrong, which isn't a knock, since it is provides a non-intuitive - at least to me ;) - result.

        2 replies →

    • As far as I understand it, an advantage of the epsilon greedy algorithm is that it will relearn the best choice if it changes over time. Now, you could do that with a logarithmically-regretful algorithm as well, but it would take more time to relearn.

    I actually noticed that Google was doing this with my Adwords account a couple of weeks ago.

    I have 2 ads in an Ad Group, and the wording between the two is different by a single word. One ad had over double the clickthru rate than the other one, just because of that single word difference.

    I noticed that Google was serving the two ads about 50% of the time, and was going to shut off the one ad that had the lower CTR, but then I let it go, and the next day, I saw that the more successful ad had almost all the views, and the less successful one was barely being served.

    • Yea this is an option in the Adwords interface, it is enabled by default but can be turned off. Google probably discovered that most of their advertisers don't check on their ad results very often and added the auto optimization so that they could show the better ads and make more profits even without the users interaction.

    The 'set and forget' aspect of this is appealing. I've sometimes wondered if you could automate the whole thing, including option generation. If you can define good enough mutation functions you could have your features literally evolve over time, without developer input. You'd need a lot of throughput to get reasonable evolution rates though. Jacking up the mutation rate won't help because really big mutations will break the layout.

    It's almost certainly impracticable, but fun to think about.

    This is part of a larger class of problems known as reinforcement learning problems. A/B testing when used for decision optimization can be thought of (sort of) as just a form of bandit using an epsilon-first approach. You play random until some threshold (using some sort of arbitrary hypothesis test), which is the learning period, then you exploit your knowledge and play estimate best option. Epsilon-greedy is nice because it tends to work well regardless, and isn't completely affected by drift (nonstationarity of the environment). One heuristic to use for deciding between using a bandity approach is to ask , is the information I will glean perishable or not pershible? For perishable problems the opportunity cost to learn is quite high, since you have less time to recoup your investment in learning (reducing the uncertainty in your estimates). Also, finding the optimal answer in these situations may be less important than just ensuring that you are playing from the set of high performing actions. We have a couple of blog posts on related issues http://www.conductrics.com/blog/

    There are appropriate solutions to the multi-armed bandit problem, and a wealth of literature out there, however this is not one of those solutions.

    Here's a simple thought experiment to show that this will not 'beat A/B testing every time.' Imagine you have two designs, one has a 100% conversion rate, one has a 0% conversion rate. Simple A/B testing will allow you to pick the the winning example. Whereas this solution is still picking the 0% design 10% of the time.

    For some other implementations check out the following links:

    For Dynamic Resampling:

    http://jmlr.csail.mit.edu/papers/volume3/auer02a/auer02a.pdf

    http://www.mynaweb.com

    For Optimal Termination Time:

    http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-te...

    • OK, in AB testing that same 0% design, you're showing it 50% of the time.

      You seem to be saying "I'll AB test it just for a little, then weed out the 0% one. but in the case of this new algorithm, I'll let it run for a long time." That's not exactly fair. Not to mention, both algorithms would allow you to clearly see the 0% option sucks.

      • But the only way this testing method is superior (at least as explained in the article) is that it automatically adjusts itself. If you're going in and adjusting manually, it sounds like this is — at best — precisely as reliable as A/B testing and subject to the same critique the OP levels at A/B testing.

        2 replies →

    • There's no point in continuing to run A/B testing indefinitely once a winner has been chosen, and like-wise there is no point in running this indefinitely either.

      Sure the OP advocates "set and forget", but it seems simply silly to do this indefinitely. Obviously, after a while, you're going to want to check in on the results and prune out low performers.

      Is that not the entire point of all these different testing methodologies to begin with?

      Also, if you were to follow the same "set and forget" strategy with A/B, you'd have a 50% chance of showing the 0% CR design, where-as it's only 10% with this method.

      Seems like it wins out in both cases to me.

    • Possibly pedantic - but it would only show the loser 5% of the time. 10% of the time it picks a random design, including the winner.

    • Obviously epsilon-greedy is not optimal. But I think the point is it's surprisingly effective and probably better than A/B testing in most applications. Most alternatives get very complicated very quickly.

      Even your thought experiment is not conclusive. First, it will take some time to be convinced which variation is better. The lost revenue during the 50/50 experimentation phase may outweigh the long-run benefit if your discount rate is high. This can happen in markets where customer acquisition is slow and expensive (i.e. a niche targeted via AdWords). Second, except in the unrealistic 0% vs 100% case, you could get stuck on wrong choice forever if you pass a statistical significance test by chance. Epsilon-greedy will correct itself over time and is asymptotically guaranteed to make the correct choice most of the time. A/B testing only has this asymptotic guarantee if you spend asymptotically long experimenting, which defeats the whole exercise.

      • Or, get complicated and vary the epsilon based on the recent change in conversions.

        For example, record the conversions from the exploration path vs each individual exploitation attempt. Then, adjust epsilon accordingly, so that it is within a certain percentage of one of your best choices.

    I like this.

    See, we've spent the last year doing an Omniture implementation, using T&T, and frankly, its been a waste of money and time.

    This isn't a criticism of the T&T tools; its a criticism of our own internal analytics handling and analysis team, and possibly the implementation guidance we received from a few places.

    You can guess at generalizations from the A/B/N changes you've made and try them again, but practically speaking? Meh. It seems like the learnings from one page are very hard to transfer to another page.

    That's why you don't see posts like "Top ten generalizations about how to make your website better!" instead you pay "SEO Expects" and "Analytics Ninjas" as consultants and they tell you things like "pages with images convert more, generally, but that may not be the case for your specific website because of blah". Handy. Do you have any more generic and obvious advice that doesn't tangibly translate into practical changes to my website?

    Here's the thing: Running an A/B is easy, but analyzing the results is really hard. Generating some kind of general analytics rules is very hard.

    The reason I like this approach is simple: it's easy. It's easy to implement. It's easy to explain. It's easy to convey to the designers that you're doing a 'throw at the wall and see what sticks' approach to pages. It's easy to explain to managers how why the best page has been picked. You don't need a self important analytics ninja who will go on endlessly about user segments and how if you segment the data differently you can see kittens in the page view statistics.

    Just make lots of pages, and see what happens.

    It's not perfect, but it's enough to get started~ (...and honestly, that's the most important thing when you're working with analytics; it beats the hell out of a 3 month implementation cycle that ends up with... page views <-- to be fair, personal opinion about uselessness of implementation not presently shared by marketing department, who likes the pretty graphs. Whatever they mean.)

    Maybe i'm missing it (it's late), but nowhere in the article does it explain why 10% of the time it picks a choice at random ("explores"). In fact, the article basically argues why it's not needed (it self-rights if the wrong choice becomes temporarily dominant). It also doesn't explain why specifically it should be a 10% randomization.

    • A random choice is needed to allow people to give rewards to options other than the dominant. I'm sure this doesn't have to be random -- and I'd be curious to see the logic behind the 10% choice -- but you have to have something that gives the other options a chance.

      Makes me wonder if the 10% number couldn't be changed to something that's a function of the number of rewards total; the longer it runs, the less variation there is and the more confident you are in the choice made.

    • The problem is that, depending on your initial settings and early results, certain settings could get such low payoff estimates that they're never tried at all (let's say one gets to 50% after one trial, and the rest all stay above 75%). You want to make sure that your solver adequately explores all choices.

    • Presumably you would want the randomness to give choices that got to 0% another chance. It seems like a better way could be to use the previous success ratios as a probability density function.

    I think rather than get hung up on e-greedy vs. A/B testing vs UCB (Bayesian vs. non Bayesian), it is helpful to first step back and think about the larger problem of online learning as a form of the prediction/control problem. The joint problem is to 1)LEARN (estimate) the values of possible courses of action in order to predict outcomes. and 2)CONTROL the application by selecting the best action for a particular situation.

    I noted elsewhere that A/B can be though of as an epsilon-first learning approach, Play random 100% till P-value<alpha, then play greedy(play the 'winner'). As an aside, it is unclear to me how using p-values is a clearer, easier, or more efficient, decision rule for these types of problems. It is almost always misinterpreted as the Prob(B>A|Data), choice of alpha determines threshold but is arbitrary, and often a straw-man default - implicitly biasing your confusion matrix. Not saying that you won't get good results, just that it is not clear that is a dominate approach.

    This simple post I wrote on agents and online learning might be informative http://mgershoff.wordpress.com/2011/10/30/intelligent-agents...

    But don't take my word for it (disclaimer: I work for www.conductrics.com, which provides decision optimization as a service) take a look at a great intro text on the topic by Sutton & Barto http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.ht...

    I created a tool a few years ago built on a similar strategy, but instead of only showing the best performing variation the chance of each variation showing was based on how well it was converting (so in an a/b/c test with conversion rates of 3%/2%/1% version a would show 1/2 of the time, version b would show 1/3 of the time and version c would show 1/6th of the time).

    There was one major flaw with this strategy though:

    Lets say you're testing a landing page and have had 1000 visitors and version A is converting at 40% while version B is converting at 30%. So it looks like so:

    Version A - 200 / 500 - 40% Version B - 150 / 500 - 30%

    A new affiliate comes on board and decides to send 200 visitors to your page from some "Buy 200 visitors for $1" domain redirection service. These visitors are such low quality that they will never ever buy anything and will probably just close the window immediately (or are bots). Now your results look something like this:

    Version A - 200 / 680 - 29.4% Version B - 150 / 520 - 28.8%

    And with just 200 visitors some random affiliate has killed all your results. Now you could add filtering and reports based on the affiliate or traffic source but this is more code and more attention you have to pay to the test.

    If you were running a traditional A/B test your results would look like this:

    Version A - 200 / 600 - 33% Version B - 150 / 600 - 25%

    And even though the overall conversion rate is lower you can still see version A is better than B.

    The idea is good and I love the idea of auto optimization, but it does have it's flaws which require more than 20 lines of code to overcome.

    • You might want to look at botlzman/softmax if you want to weight the prob of selection as a function of the current estimated value. One tricky bit is figuring out a good setting for the temperature parameter. Another poster alluded to softmax. In my experience it dosn't really perform better than a simple e-greedy approach, but maybe it has worked well for others?

    There are better approaches for tackling this problem (with 0-regret asymptotically). You can take a look at the UCB (Upper Confidence Bound) algorithm, and you can do even more if you assume some continuity, e.g. what is commonly done is to assume that the whole distribution is from a Gaussian Processes. Many interesting ideas in the literature indeed :)

    ASIDE: I love the idea of combining this kind of approach with a random site generator. You can then totally automate the business, from inception onwards. If incorporated as a company, it's an artificially intelligent artificial person.

    A problem with this (well one of them) is that it could home in on the worst of spammy techniques, like masquerading as a legitimate message, shaking animation, illegal claims and "simple" lies/promises. A solution is to filter these out - one could hand-code specific cases, but a general solution seems as difficult as AI in the first place, and requires external knowledge (such as of infrastructure messages, human visual physiology, laws and their interpretation, the concept of lying). It's hard to gather data (e.g. each lawsuit); but maybe something along the lines of a simple "complaint" button would do (use statistics to discount accidental/abusive clicks).

    • something along the lines of giving the bad behavior a huge negative reward (penalty) and you end up automating that as well :).

    A lot of the reasons for which this algo can optimize in the wrong direction have been covered here. In general terms, I agree with all of it.

    I am a big fan of two techniques that I feel would enhance an approach like the one suggested: FIR filters and Decay.

    Simply put: I believe it is important to have a mechanism through which decisions are not made too quickly. A finite impulse response filter would take care of this very well.

    In addition to that, older measurements should not carry the same importance as nice-fresh data. Who cares what people thought about the buttons (or whatever) a month or two ago? The crowd visiting the site this week could have been affected by solar flares. Maybe they prefer something else.

    Obviously you need enough traffic to be able to use such techniques.

    Not sure it's worth the complexity in all cases.

    Although I can imagine this works very well for ecommerce websites and other things where there is a very obvious single measure of success, like for example:

      * the user clicked the button
      * the user signed up
      * the user put something in their cart
      * the user paid X
    

    In this case, it's easy to create the feedback loop that is required for this testing method.

    However, in the real world, things are not always that simple. What if you want to optimise:

      * the percentage of users that returns to the site 
      * the time that users spend on your site
      * the number of pages that they view in a session
    

    I'm sure some of these metrics can also be plugged back into the bandit algorithm, but it's a lot more complicated.

    Next step: close the loop.

    Option 1: Let the next set of values (colors of the button in the example) themselves be generated after n runs.

    Option 2: let users modify the css as part of a "customization " feature. I remember they did this before the new BBC site was officially launched a few years ago.

    This epsilon-greedy [1] thing looks just like what my old boss use to tell me, 'trust but verify'. ohh.. I got so sick of those words, but at least now I have an algorithm for it.

    Just change the epsilon = 0.1 (10%) higher or lower depending on your initial (personal) confidence, and if your guess was right, and your epsilon low, then the overall impact to 'optimal' solution is negligible, but you have built in a fail safe in case you were human after all.

    [1] https://en.wikipedia.org/wiki/Multi-armed_bandit

    Of course, of the environment is truely stationary, then the easiest simpest hack method for exploration is just seed the initial values for each option (A/B../Z) with a an optimistic guess (so something you know is higher than the true value. Then just make decisions based on the current best estimate. The estimates will be driven down over time to their true values. Not claiming you should do this or that is optimal or anything but keep it in mind as a quick hack to solve the problem.

    It's funny to see the startup crowd rediscovering age-old techniques... Seen the Amazon.com home page recently? Does it look the same every day for a month?

    I can draw many parallels between this and Genetic Algorithms. There is percentage of probability in choosing next choice(In GA, children), and we have highest probability for the most profitable(In GA, fittest) choices. The solutions evolve. And the most profitable solutions (Most fittest in GA) remains.

    How is this different from Genetic Algorithms?

    • A GA is a zeroth order optimization method. A Bandit is a type of decision problem. So, bandit is a single state RL problem were one is trying to make decisions in an environment in order to min regret. GA is a general optimization approach when there is no gradient or second order info about the problem to use. Take a look at XCS classifiers for an approach that can solve bandit type problems, but uses GAs to estimate the mailings between features and rewards.

    what?!? No patio11 comment?

    • I upvoted btilly's at the top of the thread, considered saying "FWIW: my opinion is this, particularly #2", and then decided that added very little value.

    • I actually jumped straight to the comments on this article to see if he responded. Not that he's required to, but I do find his thoughts valuable on this topic.

    This data isn't fully valuable unless you know what alternate behaviour the users are performing (instead of clicking the alternate-coloured buttons). They still could be staying on your site, just not following the same funnel. They could be return visitors as well.

    • The example highlights the reward of button pressing, but most likely you would be measuring your "success" differently in your own application. You could add different behavior categories and weight each accordingly to how much it is worth to your product.

    There's one thing that keeps me concerned. Time (actually number of responses required to pick up the best option). Please correct me if I'm wrong, but I've got a feeling that this process requires more displays to effectively determine 'the best' option.

    • That's actually easy enough to fix. You trade off between certainty and time by setting the initial confidence values.

      For instance, if you're testing A, B, and C; you can start off with success/total values of 1/1 for each or 100/100 (for extreme values). If you start off with 1/1, a single hit or a single miss will swing the algorithm quickly and heavily in that particular direction; e.g. 1 miss for C results in 1/2, and brings its success rate down from 100% to 50% immediately, giving instant precedence to A and B. Whereas if you used 100/100 to start, a single miss for C would only bring it down to 100/101, letting the algorithm take much longer to "settle," but with far more confidence.

      The trick is in picking a number that suites your needs, e.g. for expensive traffic sources (AdWords) pick smaller numbers to minimize the cost of the experiment and for cheaper, more often sources use larger numbers because you can afford the extra time to be sure.

    This sounds really interesting, I might have been close to building this before without realizing. :) Seems really easy to setup as well, will be interesting to hear any counter-argument in the comments.

    • Seconded. As a non-mathsy marketing/biz guy, my response was "whoa, sounds great.", but with the caveat that my response to the actual statistical math side was "Derp."

      I'm really looking forward to hearing comments from people With Actual Maths!

    I've made it a habit to ready his articles on regular basis. There's always something thought-provoking that sticks for a long time.

    There's a lot out there beyond A/B testing, most of what people are looking for tends to be multivariate.

    This does require a significantly larger sample size for accuracy though.

    • A larger sample size than A/B? Why?

      • I don't know for sure, but it (intuitively) seems to me that you'd require a larger sample size because only 10% of the time will it explore other options. With A/B testing, you've got equal odds to hit either option, so each step is independent of the last; with this, each step depends on the previous results and thus if you have one option jump out into the lead (for whatever reason) it'd make it less likely that the other gets a good sample.

    I always wondered why the great people I know can do so much better than having to do A/B testing in their own businesses. Sure they try new things, but they are most certainly not applying an A/B type algorithm.

    It seems the article mentions something quite important in their algorithm: mostly do what has the most expected value. The people who are great just have a much better way to judge that. They can make a poster with 20 design choices (or 50 or 100) and make an estimate of what would probably work and what probably wouldn't on each one, from the size of poster, to the font sizes and types, whitespace, where to place different elements, graphics choices, etc etc etc. They certainly include a random aspect, but this is the exception rather than being the norm. Mostly what dictates choices is your expected returns on them, and the random aspects are compared with these.

    They do pay exquisite attention to their random choices: "This week I decided to see what would happen if I mixed up the day, date, location, and description rather than have it be in logical order, to see if this engaged people any more" (or: to change any other choice randomly). But it is the exception rather than the norm, and done rarely rather than often. They still pay attention to the results, which helps inform their "expected value" function.

    (I didn't spend much time on the article or the linked papers, so please feel free to correct me if I'm misinterpreting. A rigorous algorithm doesn't have much to do with real-world choices, and we are simply nowhere near having an automated web service to write your copy, regardless of how many users get to give feedback on it. So the whole thing isn't very interesting to me, and the above is just my impression of 'why'.)

    Nothing new here. This is a common technique. Any email newsletter service worth its salt has been doing this for years, and I think many A/B testing tools have as well. A/B testing doesn't mean you'll assign 50% of users to each version.

    There is no evidence in this article that

    1) this approach is better than A/B testing. 2) it doesn't suffer from the the Heisenberg principle

    Also, the call out box at the top is obviously an ad but it's not marked "Sponsored link", which is really scammy.

    • >Also, the call out box at the top is obviously an ad but it's not marked "Sponsored link", which is really scammy.

      That's because it's a link to a webapp the author built.