Understanding the working of bandit algorithms

4 min readApr 18, 2021

In the previous post, we introduced the Multi-Armed Bandit problem and why is it useful. In this post, we will develop an intuition on how tackle this problem an algorithmically and understand the working of the popular UCB (Upper Confidence Bound) and Thompson Sampling algorithms.

The Multi-Armed bandit problem presents an exploration-exploitation dilemma to the player. The key question being

Should one play the seemingly best-arm in the next round?
Or should one try to obtain reward from one of the other arms to better decide which arm is the best-arm?

Let’s think about some simple strategies to build the intuition

Strategy 1: (explore then exploit) Conduct a trial phase in which you sample each of the arm 5 times and then always pick the arm that generated the best average reward in their trial phase. While, this idea seems intuitive, there is a chance that the results of trial phase may not be accurate (as we sample each arm only 10 times) and we end up pulling a sub-optimal arm for the rest of the experiment.
Strategy 2: (explore longer and then exploit) One can argue that increasing the length of the trial phase can be increased, to say 50 samples per arm, and then the confidence in judging the best-arm would be better and it would be less likely that we play a sub-optimal arm in the remaining experiment. However, if one increases the length of the trial phase, then the player is potentially losing out rewards in the trial phase itself the sub-optimal arms are being pulled a lot of times.
Strategy 3: (ε-greedy) In strategy 1 and 2, after the trial phase is over, the player is not actively thinking about alternative possibilities, which are affecting their performance. To overcome this, we need to balance the exploration-exploitation tradeoff in a smarter way. So how about a more active strategy. For instance, a player can decide that before each round, they will do exploration with 10% probability and exploitation otherwise. Which means that they will pick the arm with the highest average reward so far with 90% probability and with 10% probability, they will pick any one of the remaining arms. A key advantage of this strategy is that the player never stops exploring and at the same time for 90% of the time they are taking advantage of information available to them by exploiting.

At this point ε-greedy may sound pretty convincing. However, we can still do better. Let’s consider the following situation. Suppose you are at a decision point, where you have the following information.

In this situation, Arm 2 seems best, Arm 1 seems pretty bad and Arm 3 seems close to Arm 2. So if one were to explore, would you rather explore with Arm 1 or Arm 3? It seems that one should explore Arm 3 instead of Arm 1 as it is more likely that its performance may be closer to that of Arm 2. However, with ε-greedy, during exploration it will not distinguish between Arm 1 and Arm 3 and pick one of those at random.

Now let us consider another situation. Suppose after Z samples, we are in the following situation.

Which arm would you pick? In this situation, Arm 1 and Arm 3 have similar average reward so far but Arm 1 is sampled much less than Arm 3. So if we were to give benefit of doubt to someone (when judging their inherent reward producing capabilities) it would be to Arm 1. Thus, we would like to sample Arm 1 in this case.

Motivated by this the Upper Confidence Bound (UCB) and Thompson Sampling algorithms were invented. The UCB algorithm constructs an index for each arm, termed as the UCB index by equating the Index I = average reward + exploration factor. The exploration factor decreases as the samples obtained from an arm increase. Therefore, it tends to select arms

i) If they have been giving good average

ii) or they have not been sampled enough (due to their high exploration factor)

The Thompson sampling algorithm achieves the same effect by asking this question: “Based on the samples observed so far, what is the probability that a given arm is optimal?” It then selects the arm proportional to this probability at each round. These algorithms have shown significant improvements over ε -greedy algorithm, both empirically and theoretically. This is primarily because of the fact that

UCB and Thompson sampling account for the fact that some sub-optimal arms may be very sub-optimal and they sample them far fewer times than ε -greedy
UCB and Thompson sampling explore more initially (when the exploration index dominates for all arms) and then decreases it over time (when average reward dominates the index), which is intuitively how a human would address this problem. On the other hand, the rate of exploration always remains 10% in ε -greedy.

I hope this helps in understanding the design of algorithms for Multi-Armed bandits.

Understanding the working of bandit algorithms

Written by Samarth Gupta

No responses yet