Deep Deterministic Policy Gradient

The fundamental problem with DQN is that even though it can handle high dimensional state spaces, it can only handle discrete and low dimensional action spaces. But, a lot of tasks need continuous and high dimensional action spaces. For example, vehicle control is often formulated as needing a steering between $[-1, 1]$, a braking amount between $[0, 1]$ and an acceleration amount between $[0, 1]$.

Before discussing DDPG, let's discuss DPG, or deterministic policy gradient, the theorem on which DDPG is based. Consider an actor critic setup, where the actor holds the policy $\mu(s)$, and has weights $\theta$, while the critic learns to estimate $Q(s, a)$ from any $s, a$.

To quantify the actor's performance, we define an objective:

\[ \begin{align*} J(\mu) &= \mathbb{E}_ s[\mbox{total discounted reward from $s$, doing $a=\mu(s)$ every step}]\\ &\approx \mathbb{E}_ s[Q(s, a)], \mbox{ where $a=\mu(s)$} \end{align*} \]

The actor's updates can be based on the gradient of this objective:

\[ \nabla_\theta J(\mu) \approx \mathbb{E}_ s[\nabla_\theta Q(s, a)] \]

$Q(s, a)$ is the critic output, but we need to find its gradient with respect to the actor weights $\theta$. To do this, we can use the chain rule:

\[ \nabla_\theta J(\mu) \approx \mathbb{E}_ s[\nabla_a Q(s, a) \cdot \nabla_\theta a] = \mathbb{E}_ s[\nabla_a Q(s, a) \cdot \nabla_\theta \mu(s)] \]

For the critic, the updates can be the Bellman updates. Directly using this formulation has 3 problems:

First, the Bellman loss is dependent on a quantity which is itself being updated, i.e. $Q(s, a)$; so it is hard to guarantee convergence.

To solve this problem, the DDPG algorithm doesn't directly use the same networks to compute targets. Instead, there are two actor networks, $\mu$ and $\mu'$, and two critic networks, $Q$ and $Q'$. The target networks $\mu'$ and $Q'$ are slowly tracked towards the main networks $\mu$ and $Q$, by an update like:

\[ \mbox{target weights} \leftarrow \tau \cdot \mbox{main weights} + (1-\tau) \cdot \mbox{target weights}, \mbox{ where $\tau \ll 1$} \]

Second, the state inputs to $\mu$ and $Q$ may be comprised of heterogeneous components; for example, a driving environment's state could consist of both velocity and position components. In this case, the updates can only adapt to the scale of one of the units. This problem can be solved by using batch normalization.

Third, exploration is difficult. In continuous action spaces, it doesn't make sense to take random actions at every step. Rather, it is a better idea to take random actions that are temporally related to the previous few actions. For example, if steering left, steering right, driving straight and braking are the major high level actions we can undertake while choosing continuous driving actions, then it makes sense to drive straight for a while and try steering left for a while, and so on. Hence, DDPG adds a noise to the actor policy that can produce temporally extended actions:

\[ \mu(s) \leftarrow \mu(s)+\mathcal{N} \]

This noise could be an Ornstein-Uhlenbeck process over the policy.