PPO, or proximal policy optimization methods try to do modifications to TRPO ideas, to make them easy to implement, while keeping the reliable performance benefits.

In TRPO, our goal is to:

$\mbox{maximize } \mathbb{E}_ {s \sim \rho_\pi, a \sim q} \bigg[ \frac{\pi(a|s)}{q(a|s)} \cdot A_\pi(s, a) \bigg] \mbox{ ; while D_{KL}(\pi, \tilde{\pi}) is within some constant \delta}$

The first modification that PPO suggests is to clip the optimization objective. Let the ratio $r$ be defined as:

$r = \frac{\pi(a|s)}{q(a|s)}$

In this modification, we clip the optimization objective from the above, so we can avoid too large updates:

$\mbox{maximize } \mathbb{E}_ {s \sim \rho_\pi, a \sim q} \bigg[ \min(r, 1+\epsilon) \cdot A_\pi(s, a) \bigg]$

Secondly, we can use an adaptive KL penalty coefficient. The TRPO optimization problem, even though constrained in the above formulation, can be unconstrained:

$\mbox{maximize } \mathbb{E}_ {s \sim \rho_\pi, a \sim q} [ r \cdot A_\pi(s, a) - \beta \cdot D_{KL}(\pi, \tilde{\pi}) ]$

In this unconstrained case, we can aim for the $D_{KL}$ term to be closer to a target value $d_t$. If the term is too large, say $>\lambda_1 d_t$, we can reduce it's effect by decreasing $\beta$. Similarly, if the term is too small, say $<\lambda_2 d_t$, then we can boost its effect by increasing $\beta$.

Finally, these objectives can be combined with other well known objectives, like adding a negative term for value function error, or adding a bonus term to encourage exploration.

One way to use this in an actor-critic setting is to estimate advantage by a fixed horizon discounted advantage:

$\hat{A}_ t = \delta_ t + (\gamma\lambda)\delta_ {t+1} + \cdots + (\gamma\lambda)^{T-t+1}\delta_{T-1} \mbox{ ; where \delta_t = r_t+\gamma V(s_{t+1}) - V(s_t)}$

Here $\lambda$ is arbitrary. This PPO algorithm would work by running the actor for say $N$ times, with the old policy for $T$ timesteps each, and computing $T$ advantage estimates; and then finally optimizing the objective with Monte Carlo estimation to get the new policy parameters, and then doing this for as many iterations as needed.