Comment by ted_dunning

5 days ago

Multi-armed bandit approaches do not imply an immediate feedback loop. They do the best you can do with delayed feedback or with episodic adjustment as well.

So if you are doing A/B tests, it is quite reasonable to use Thompson sampling at fixed intervals to adjust the proportions. If your response variable is not time invariant, this is actually best practice.

Having significant experience with bandits in production, I strongly recommend only using them for immediate feedback. If the rewards are at all disconnected from the action you likely won’t be happy with the results.