UCB1, Multi Armed Bandits and Regret

There are $K$ slot machines, each of which churn out rewards according to some unknown probability distributions. Let the mean rewards for each slot machine be $\mu_i$. Assuming $\mu^* $ represents the best slot machine, define:

\[ \Delta_i := \mu^* -\mu_i \]

Consider the first $n$ steps. Define regret, $L(n)$ as the expected reward loss. If $T_i(n)$ is the number of times arm $i$ is played in the first $n$ steps, then

\[ L(n) = \sum_{i=1}^n \Delta_i \mathbb{E} [T_i(n)] \]

Let $\{x\}$ be the indicator variable equal to $1$ if $x$ is true, otherwise $0$.

Goal: Minimize regret $L(n)$.
Algorithm: First, initialize the counts $n_i := 1$ for each of the arms by choosing each arm exactly once in the first $K$ steps. Then, using the sample means $S_i(t)$ for each arm as an approximate $\mu_i$, at each step $t$, choose the arm which has the highest value of: $$ \lambda_i(t) := S_i(t)+\sqrt{\frac{2\ln t}{n_i}} $$

Let's bound $T_i(n)$. To do so, start with the knowledge that for the first $K$ steps, each of the arm has to be played exactly once, to initialize the counts $n_i$. Hence,

\[ T_i(n) = 1 + \sum_{t=K+1}^n \{ \mbox{at time $t$, arm $i$ was chosen} \} \]

We could rewrite this in terms of another arbitrary number $c$, by adding an extra condition:

\[ T_i(n) \leq c + \sum_{t=K+1}^n \{ \mbox{at time $t$, arm $i$ was chosen, and $T_i(t-1) \geq c$} \} \]

At time $t$, arm $i$ is chosen if $\lambda_i(t)$ is better than the optimal arm's estimate of $\lambda^* (t)$:

\[ T_i(n) \leq c + \sum_{t=K+1}^n \{ \mbox{$\lambda_i(t) \geq \lambda^* (t)$ and $T_i(t-1) \geq c$} \} \]

The upper bound still holds if we replace the condition by a more probable event, $0 < t' < t$:

\[ T_i(n) \leq c + \sum_{t=K+1}^n \{ \mbox{$\max_{t'} \lambda_i(t') \geq \min_{t'} \lambda^* (t')$ and $T_i(t-1) \geq c$} \} \]

Because $T_i(t-1) \geq c$, the $0 < t' < t$ should actually be $c \leq t' < t$.

\[ T_i(n) \leq c + \sum_{t=K+1}^n \{ \max_{t_1} \lambda_i(t_1) \geq \min_{t_2} \lambda^* (t_2); c \leq t_1 < t \mbox{ and } 0 < t_2 < t \} \]

The upper bound holds if we replace the condition again with:

\[ T_i(n) \leq c + \sum_{t=K+1}^n \sum_{t_1=c}^t \sum_{t_2=1}^t \{ \lambda_i(t_1) \geq \lambda^* (t_2) \} \]

Consider when $\{ \lambda_i(t_1) \geq \lambda^* (t_2) \} = 1$. This is possible if either of the following happen:
  1. $S_i(t_1)$ overestimates $\mu_i$ $$ S_i(t_1) \geq \mu_i+\sqrt{\frac{2\ln t}{n_i}} $$
  2. $S^* (t_2)$ underestimates $\mu^* $ $$ S^* (t_2) \leq \mu^* - \sqrt{\frac{2\ln t}{n^*}} $$
  3. the true population means $\mu_i$ and $\mu^* $ are really close $$ \mu^* - \sqrt{\frac{2\ln t}{n_i}} < \mu_i+\sqrt{\frac{2\ln t}{n_i}}, \mbox{ or }, \Delta_i < 2\sqrt{\frac{2\ln t}{n_i}} $$

By Chernoff bound, the probability for cases 1, 2 are

\[ P(\cdots) \leq \exp\bigg\{ -2n_i \bigg( \sqrt{\frac{2\ln t}{n_i}} \bigg)^2 \bigg\} = t^{-4} \]

And, condition 3 can be avoided by using

\[ n_i = c \geq \frac{8\ln n}{\Delta_i^2}, \mbox{ or }, \Delta_i \geq 2\sqrt{\frac{2\ln n}{n_i}} \]

That means: $$ \mathbb{E}[T_i(n)] \leq c + \sum_{t=K+1}^n \sum_{t_1=c}^t \sum_{t_2=1}^t 2t^{-4} \leq \frac{8\ln n}{\Delta_i^2} $$ Hence, the regret becomes: $$ L(n) \leq \sum_{i=1}^n \Delta_i \cdot \frac{8\ln n}{\Delta_i^2} = \sum_{i=1}^n \frac{8\ln n}{\Delta_i} $$

From Lai & Robbins (1985), we know that for single parameter bandits, policies that satisfy asymptotically the following also asymptotically play the optimal arm the most often (and exponentially):

\[ \mathbb{E}[T_i(n)] \leq \frac{\ln n}{D(p_j || p^* )} \]

where $p_{\cdot}$ are the respective probability densities for reward functions and $D(\cdot || \cdot)$ is the KL divergence.

References

  1. Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer. "Finite-time analysis of the multiarmed bandit problem." Machine learning 47.2-3 (2002): 235-256.

  2. Lai, Tze Leung, and Herbert Robbins. "Asymptotically efficient adaptive allocation rules." Advances in applied mathematics 6.1 (1985): 4-22.