# UCB1, Multi Armed Bandits and Regret

There are $K$ slot machines, each of which churn out rewards according to some unknown probability distributions. Let the mean rewards for each slot machine be $\mu_i$. Assuming $\mu^* $ represents the best slot machine, define:

\[ \Delta_i := \mu^* -\mu_i \]

Consider the first $n$ steps. Define regret, $L(n)$ as the expected reward loss. If $T_i(n)$ is the number of times arm $i$ is played in the first $n$ steps, then

\[ L(n) = \sum_{i=1}^n \Delta_i \mathbb{E} [T_i(n)] \]

Let $\{x\}$ be the indicator variable equal to $1$ if $x$ is true, otherwise $0$.

Let's bound $T_i(n)$. To do so, start with the knowledge that for the first $K$ steps, each of the arm has to be played exactly once, to initialize the counts $n_i$. Hence,

\[ T_i(n) = 1 + \sum_{t=K+1}^n \{ \mbox{at time $t$, arm $i$ was chosen} \} \]

We could rewrite this in terms of another arbitrary number $c$, by adding an extra condition:

\[ T_i(n) \leq c + \sum_{t=K+1}^n \{ \mbox{at time $t$, arm $i$ was chosen, and $T_i(t-1) \geq c$} \} \]

At time $t$, arm $i$ is chosen if $\lambda_i(t)$ is better than the optimal arm's estimate of $\lambda^* (t)$:

\[ T_i(n) \leq c + \sum_{t=K+1}^n \{ \mbox{$\lambda_i(t) \geq \lambda^* (t)$ and $T_i(t-1) \geq c$} \} \]

The upper bound still holds if we replace the condition by a more probable event, $0 < t' < t$:

\[ T_i(n) \leq c + \sum_{t=K+1}^n \{ \mbox{$\max_{t'} \lambda_i(t') \geq \min_{t'} \lambda^* (t')$ and $T_i(t-1) \geq c$} \} \]

Because $T_i(t-1) \geq c$, the $0 < t' < t$ should actually be $c \leq t' < t$.

\[ T_i(n) \leq c + \sum_{t=K+1}^n \{ \max_{t_1} \lambda_i(t_1) \geq \min_{t_2} \lambda^* (t_2); c \leq t_1 < t \mbox{ and } 0 < t_2 < t \} \]

The upper bound holds if we replace the condition again with:

\[ T_i(n) \leq c + \sum_{t=K+1}^n \sum_{t_1=c}^t \sum_{t_2=1}^t \{ \lambda_i(t_1) \geq \lambda^* (t_2) \} \]

By Chernoff bound, the probability for cases 1, 2 are

\[ P(\cdots) \leq \exp\bigg\{ -2n_i \bigg( \sqrt{\frac{2\ln t}{n_i}} \bigg)^2 \bigg\} = t^{-4} \]

And, condition 3 can be avoided by using

\[ n_i = c \geq \frac{8\ln n}{\Delta_i^2}, \mbox{ or }, \Delta_i \geq 2\sqrt{\frac{2\ln n}{n_i}} \]

From Lai & Robbins (1985), we know that for single parameter bandits, policies that satisfy asymptotically the following also asymptotically play the optimal arm the most often (and exponentially):

\[ \mathbb{E}[T_i(n)] \leq \frac{\ln n}{D(p_j || p^* )} \]

where $p_{\cdot}$ are the respective probability densities for reward functions and $D(\cdot || \cdot)$ is the KL divergence.