Comment by isoprophlex

6 days ago

As one of the comments below the article states, the probabilistic alternative to epsilon-greedy is worth exploring ad well. Take the "bayesian bandit", which is not much more complex but a lot more powerful.

If you crave more bandits: https://jamesrledoux.com/algorithms/bandit-algorithms-epsilo...

Just a warning to those people who are potentially implementing it: it doesn't really matter. The blog author addresses this, obliquely (says that the simplest thing is best most of the time), but doesn't make it explicit.

In my experience, obsessing on the best decision strategy is the biggest honeypot for engineers implementing MAB. Epsilon-greedy is very easy to implement and you probably don't need anything more. Thompson sampling is a pain in the butt, for not much gain.

  • "Easy to implement" is a good reason to use bubble sort too.

    In a normal universe, you just import a different library, so both are the same amount of work to implement.

    Multiarmed bandit seems theoretically pretty, but it's rarely worth it. The complexity isn't the numerical algorithm but state management.

    * Most AB tests can be as simple as a client-side random() and a log file.

    * Multiarmed bandit means you need an immediate feedback loop, which involves things like adding database columns, worrying about performance (since each render requires another database read), etc. Keep in mind the database needs to now store AB test outcomes and use those for decision-making, and computing those is sometimes nontrivial (if it's anything beyond a click-through).

    * Long-term outcomes matter more than short-term. "Did we retain a customer" is more important than "did we close one sale."

    In most systems, the benefits aren't worth the complexity. Multiple AB tests also add testing complexity. You want to test three layouts? And three user flows? Now, you have nine cases which need to be tested. Add two color schemes? 18 cases. Add 3 font options? 54 cases. The exponential growth in testing is not fun. Fire-and-forget seems great, but in practice, it's fire-and-maintain-exponential complexity.

    And those conversion differences are usually small enough that being on the wrong side of a single AB test isn't expensive.

    Run the test. Analyze the data. Pick the outcome. Kill the other code path. Perhaps re-analyze the data a year later with different, longer-term metrics. Repeat. That's the right level of complexity most of the time.

    If you step up to multiarm, importing a different library ain't bad.

    • Multi-armed bandit approaches do not imply an immediate feedback loop. They do the best you can do with delayed feedback or with episodic adjustment as well.

      So if you are doing A/B tests, it is quite reasonable to use Thompson sampling at fixed intervals to adjust the proportions. If your response variable is not time invariant, this is actually best practice.

      1 reply →

    • Sorry but bubble sort is a terrible example here. You implement a more difficult sorting algorithm, like quicksort, because the benefits of doing so, versus using bubble sort, are in many cases huge. I.e., the juice is worth the squeeze.

      Whereas the comment you’re responding to is rightly pointing out that for most orgs, the marginal gains of using an approach more complex than Epsilon greedy probably aren’t worth it. I.e., the juice isn’t worth the squeeze.

      2 replies →

    • You've either missed the point of what I wrote, or you're arguing with someone else.

      I'm talking about the difference between epsilon-greedy vs. a more complex optimization scheme within the context of implementing MAB. You're making arguments about A/B testing vs MAB.

  • Thompson Sampling is trivial to implement, especially with binary rewards. ChatGPT can do it reliably from scratch.