Proximal Policy Optimization

PPO, or proximal policy optimization methods try to do modifications to TRPO ideas, to make them easy to implement, while keeping the reliable performance benefits.

In TRPO, our goal is to:

\[ \mbox{maximize } \mathbb{E}_ {s \sim \rho_\pi, a \sim q} \bigg[ \frac{\pi(a|s)}{q(a|s)} \cdot A_\pi(s, a) \bigg] \mbox{ ; while $D_{KL}(\pi, \tilde{\pi})$ is within some constant $\delta$} \]

The first modification that PPO suggests is to clip the optimization objective. Let the ratio $r$ be defined as:

\[ r = \frac{\pi(a|s)}{q(a|s)} \]

In this modification, we clip the optimization objective from the above, so we can avoid too large updates:

\[ \mbox{maximize } \mathbb{E}_ {s \sim \rho_\pi, a \sim q} \bigg[ \min(r, 1+\epsilon) \cdot A_\pi(s, a) \bigg] \]

Secondly, we can use an adaptive KL penalty coefficient. The TRPO optimization problem, even though constrained in the above formulation, can be unconstrained:

\[ \mbox{maximize } \mathbb{E}_ {s \sim \rho_\pi, a \sim q} [ r \cdot A_\pi(s, a) - \beta \cdot D_{KL}(\pi, \tilde{\pi}) ] \]

In this unconstrained case, we can aim for the $D_{KL}$ term to be closer to a target value $d_t$. If the term is too large, say $>\lambda_1 d_t$, we can reduce it's effect by decreasing $\beta$. Similarly, if the term is too small, say $<\lambda_2 d_t$, then we can boost its effect by increasing $\beta$.

Finally, these objectives can be combined with other well known objectives, like adding a negative term for value function error, or adding a bonus term to encourage exploration.

One way to use this in an actor-critic setting is to estimate advantage by a fixed horizon discounted advantage:

\[ \hat{A}_ t = \delta_ t + (\gamma\lambda)\delta_ {t+1} + \cdots + (\gamma\lambda)^{T-t+1}\delta_{T-1} \mbox{ ; where $\delta_t = r_t+\gamma V(s_{t+1}) - V(s_t)$} \]

Here $\lambda$ is arbitrary. This PPO algorithm would work by running the actor for say $N$ times, with the old policy for $T$ timesteps each, and computing $T$ advantage estimates; and then finally optimizing the objective with Monte Carlo estimation to get the new policy parameters, and then doing this for as many iterations as needed.