# Bootstrapped DQN

Typically, how DQN works is that we randomly explore for some fraction of time, and build up our experience buffer with $(s, a, s', r)$ tuples. In bootstrapped DQN, the architecture utilizes multiple $Q$ networks, or heads $(Q_1, Q_2, \cdots Q_K)$ with a common state embedding network:

Whatever strategy we use to explore ($\epsilon$-greedy or Thompson sampling), instead of pushing $(s, a, s', r)$ tuples, we push $(s, a, s', r, m)$ tuples to the experience buffer, where $m$ is a **binary mask vector** of size $K$, such that $m_i$ indicates whether this tuple should be used to train the network $Q_i$ or not.

Note that even though the tuple was generated by a specific $Q$ network, it would be used by multiple $Q$ networks for training. Thus, bootstrapped DQN constructs different approximations to the posterior $Q^* $ with the same initial data.

The diversity of approximations means that these heads start out trying random actions (because of diverse random initial $Q _ k$), but when some head finds a good state and generalizes to it, some (but not all) of the heads will learn from it, because of the bootstrapping. Eventually, other heads will either find other good states or end up learning the best good states found by the other heads. So, the architecture explores well and once a head achieves the optimal policy, eventually, all heads achieve the policy.

## References

Osband, Ian, et al. "Deep exploration via bootstrapped DQN." Advances in neural information processing systems. 2016.

Gaurav, Ashish. "Deep Exploration via Bootstrapped DQN (Summary)." STAT946/StatWiki