# Bootstrapping in DQN for Exploration

### Bootstrapping

Suppose we want to study a property $\theta$ of a given population. For this, we have some sample data $D$ such that $|D|=n$. The estimator $\hat{\theta}$ can be calculated over the points in $D$.

Using **bootstrapping**, we can compute confidence bounds on $\hat{\theta}$. To do this, we sample more points from $D$ uniformly with replacement, and create a new sample $D'$. The new sample should represent the original population well, and if we slice up the first and last $c$ percentiles, we could compute $\hat{\theta}$ over the remaining distribution and call it the estimator's value with confidence $c$.

### Bootstrapped DQN for Exploration

Typically, how DQN \cite{mnih2013playing} works is that we randomly explore for some fraction of time, and build up our experience buffer with $(s, a, s', r)$ tuples. In bootstrapped DQN \cite{osband2016deep}, the architecture utilizes multiple $Q$ networks, or heads $(Q_1, Q_2, \cdots Q_K)$ with a common state embedding network:

Whatever strategy we use to explore ($\epsilon$-greedy or Thompson sampling), instead of pushing $(s, a, s', r)$ tuples, we push $(s, a, s', r, m)$ tuples to the experience buffer, where $m$ is a **binary mask vector** of size $K$, such that $m_i$ indicates whether this tuple should be used to train the network $Q_i$ or not.

Note that even though the tuple was generated by a specific $Q$ network, it would be used by multiple $Q$ networks for training. Thus, bootstrapped DQN constructs different approximations to the posterior $Q^* $ with the same initial data.

The diversity of approximations means that these heads start out trying random actions (because of diverse random initial $Q _ k$), but when some head finds a good state and generalizes to it, some (but not all) of the heads will learn from it, because of the bootstrapping. Eventually, other heads will either find other good states or end up learning the best good states found by the other heads. So, the architecture explores well and once a head achieves the optimal policy, eventually, all heads achieve the policy.