### Problem Statement

There are $K$ slot machines, each of which churn out rewards according to some unknown probability distributions. Let the mean rewards for each slot machine be $\mu_i$. Assuming $\mu^*$ represents the best slot machine, define:

$\Delta_i := \mu^* -\mu_i$

Consider the first $n$ steps. Define regret, $L(n)$ as the expected reward loss. If $T_i(n)$ is the number of times arm $i$ is played in the first $n$ steps, then

$L(n) = \sum_{i=1}^n \Delta_i \mathbb{E} [T_i(n)]$

Let $\{x\}$ be the indicator variable equal to $1$ if $x$ is true, otherwise $0$.

Goal: Minimize regret $L(n)$.

### Upper Confidence Bound Algorithm

Algorithm UCB1 \cite{auer2002finite}: First, initialize the counts $n_i := 1$ for each of the arms by choosing each arm exactly once in the first $K$ steps. Then, using the sample means $S_i(t)$ for each arm as an approximate $\mu_i$, at each step $t$, choose the arm which has the highest value of:

$\lambda_i(t) := S_i(t)+\sqrt{\frac{2\ln t}{n_i}}$

### Optimality

From \cite{lai1985asymptotically}, we know that for single parameter bandits, policies that satisfy asymptotically the following also asymptotically play the optimal arm the most often (and exponentially):

$\mathbb{E}[T_i(n)] \leq \frac{\ln n}{D(p_j || p^* )}$

where $p_{\cdot}$ are the respective probability densities for reward functions and $D(\cdot || \cdot)$ is the KL divergence.

### Proof of Optimality

Let's bound $T_i(n)$. To do so, start with the knowledge that for the first $K$ steps, each of the arm has to be played exactly once, to initialize the counts $n_i$. Hence,

$T_i(n) = 1 + \sum_{t=K+1}^n \{ \mbox{at time t, arm i was chosen} \}$

We could rewrite this in terms of another arbitrary number $c$, by adding an extra condition:

$T_i(n) \leq c + \sum_{t=K+1}^n \{ \mbox{at time t, arm i was chosen, and T_i(t-1) \geq c} \}$

At time $t$, arm $i$ is chosen if $\lambda_i(t)$ is better than the optimal arm's estimate of $\lambda^* (t)$:

$T_i(n) \leq c + \sum_{t=K+1}^n \{ \mbox{\lambda_i(t) \geq \lambda^* (t) and T_i(t-1) \geq c} \}$

The upper bound still holds if we replace the condition by a more probable event, $0 < t' < t$:

$T_i(n) \leq c + \sum_{t=K+1}^n \{ \mbox{\max_{t'} \lambda_i(t') \geq \min_{t'} \lambda^* (t') and T_i(t-1) \geq c} \}$

Because $T_i(t-1) \geq c$, the $0 < t' < t$ should actually be $c \leq t' < t$.

$T_i(n) \leq c + \sum_{t=K+1}^n \{ \max_{t_1} \lambda_i(t_1) \geq \min_{t_2} \lambda^* (t_2); c \leq t_1 < t \mbox{ and } 0 < t_2 < t \}$

The upper bound holds if we replace the condition again with:

$T_i(n) \leq c + \sum_{t=K+1}^n \sum_{t_1=c}^t \sum_{t_2=1}^t \{ \lambda_i(t_1) \geq \lambda^* (t_2) \}$

Consider when $\{ \lambda_i(t_1) \geq \lambda^* (t_2) \} = 1$. This is possible if either of the following happen:
1. $S_i(t_1)$ overestimates $\mu_i$ $$S_i(t_1) \geq \mu_i+\sqrt{\frac{2\ln t}{n_i}}$$
2. $S^* (t_2)$ underestimates $\mu^*$ $$S^* (t_2) \leq \mu^* - \sqrt{\frac{2\ln t}{n^*}}$$
3. the true population means $\mu_i$ and $\mu^*$ are really close $$\mu^* - \sqrt{\frac{2\ln t}{n_i}} < \mu_i+\sqrt{\frac{2\ln t}{n_i}}, \mbox{ or }, \Delta_i < 2\sqrt{\frac{2\ln t}{n_i}}$$

By Chernoff bound, the probability for cases 1, 2 are

$P(\cdots) \leq \exp\bigg\{ -2n_i \bigg( \sqrt{\frac{2\ln t}{n_i}} \bigg)^2 \bigg\} = t^{-4}$

And, condition 3 can be avoided by using

$n_i = c \geq \frac{8\ln n}{\Delta_i^2}, \mbox{ or }, \Delta_i \geq 2\sqrt{\frac{2\ln n}{n_i}}$

That means: $$\mathbb{E}[T_i(n)] \leq c + \sum_{t=K+1}^n \sum_{t_1=c}^t \sum_{t_2=1}^t 2t^{-4} \leq \frac{8\ln n}{\Delta_i^2}$$ Hence, the regret becomes: $$L(n) \leq \sum_{i=1}^n \Delta_i \cdot \frac{8\ln n}{\Delta_i^2} = \sum_{i=1}^n \frac{8\ln n}{\Delta_i}$$