Trust Region Policy Optimization

Motivation: monotonous updates to the policy; do policy updates in the best direction, always

Let $\rho_0$ be the distribution of start states. So, for any $s_0 \in \mathbf{S}$, $\rho_0(s_0)$ gives the probability of $s_0$ being a start state. Let's sample some $s_0 \sim \rho_0(s_0)$. For the remainder of this post, $\tau(s_0, \pi)=(s_0, a_0, s_1 \cdots)$ represents a trajectory of states and actions, beginning from state $s_0$ and following $\pi$.

Let $\eta(\pi)$ be the expected total discounted reward from a possible start state, following a stochastic policy $\pi$:

\[ \eta(\pi) = \mathbb{E}_ {\tau(s_0, \pi)}[r(s_0)+\gamma\cdot r(s_1) + \gamma^2 \cdot r(s_2)+\cdots] \]

And, let's define the advantage function for any $s,a$ as:

\[ A_\pi(s, a) = Q_\pi(s, a)-V_\pi(s) \]

Suppose we do some improvement in $\pi$ and get a new policy $\tilde \pi$. Let's compute the expected discounted advantage over the trajectory $\tau(s_0, \tilde{\pi})$: $$ \begin{align*} &\mathbb{E}_{\tau(s_0, \tilde{\pi})}\bigg[ \sum_{t=0}^\infty \gamma^t A_\pi(s_t, a_t) \bigg] \\ &= \mathbb{E}_{\tau(s_0, \tilde{\pi})}\bigg[ \sum_{t=0}^\infty \gamma^t \{ Q_\pi(s_t,a_t)-V_\pi(s_t) \} \bigg] \\ &= \mathbb{E}_{\tau(s_0, \tilde{\pi})}\bigg[ \sum_{t=0}^\infty \gamma^t \{ r(s_t) + \gamma V_\pi(s_{t+1})-V_\pi(s_t) \} \bigg] \\ &= \mathbb{E}_{\tau(s_0, \tilde{\pi})}\bigg[ \sum_{t=0}^\infty \gamma^t r(s_t) \bigg] + \mathbb{E}_{\tau(s_0, \tilde{\pi})}\bigg[ \sum_{t=0}^\infty \gamma^t \{ \gamma V_\pi(s_{t+1})-V_\pi(s_t) \} \bigg] \\ &= \eta(\tilde{\pi}) + \mathbb{E}_{\tau(s_0, \tilde{\pi})}\bigg[ \sum_{t=0}^\infty \gamma^t \{ \gamma V_\pi(s_{t+1})-V_\pi(s_t) \} \bigg] \\ &= \eta(\tilde{\pi}) + \mathbb{E}_{\tau(s_0, \tilde{\pi})}\bigg[ \{ \gamma V_\pi(s_1) + \cdots \}- \{ V_\pi(s_0) + \gamma V_\pi(s_1) + \cdots \} \bigg] \\ &= \eta(\tilde{\pi}) - V_\pi(s_0) = \eta(\tilde{\pi}) - \eta(\pi) = \Delta \eta \end{align*} $$ The improvement $\Delta \eta$ in the expected total discounted reward is the expected discounted advantage.

Let's rearrange this result:

\[ \Delta \eta = \mathbb{E}_ {\tau(s_0, \tilde{\pi})}\bigg[ \sum_{t=0}^\infty \gamma^t A_\pi(s_t, a_t) \bigg] \]

Since we are following the trajectory $\tau(s_ 0, \tilde{\pi})$, we will encounter the term $A_\pi(s, a)$ whenever state $s$ happens, and we sample action $a$ from the stochastic policy. So, at every timestep $t$, if $s_t=s$, we will have the advantage term multiplied by $\gamma^t$. For all states, we can count its weight by computing:

\[ \rho_{\tilde{\pi}}(s) = I(s_0=s)+\gamma I(s_1=s)+\gamma^2 I(s_2=s) + \cdots \]

However, to get the frequency of a state action pair, we need to multiply this quantity by the probability of occurrence of $a$ given $s$, which is $\tilde{\pi}(a|s)$. Then, we can rewrite the result on $\Delta \eta$ by removing the dependence on $t$:

\[ \Delta \eta = \sum_ {s, a} \rho_{\tilde{\pi}}(s) \cdot \tilde{\pi}(a|s) \cdot A_\pi(s, a) = \sum_s \rho_{\tilde{\pi}}(s) \sum_a \tilde{\pi}(a|s) \cdot A_\pi(s, a) \]

It is difficult to optimize this relation to get $\tilde{\pi}$ from $\pi$. Hence, TRPO approximates $\rho_{\tilde{\pi}}(s)$ with $\rho_{\pi}(s)$:

\[ \Delta \eta \approx \sum_s \rho_{\pi}(s) \sum_a \tilde{\pi}(a|s) \cdot A_\pi(s, a) = \zeta \mbox{ (say)} \]

Let's try to show that $\zeta$ is indeed a good approximation to $\Delta \eta$. To do so, first define $\bar{A}(s)$ as the expected advantage from state $s$ using $\tilde{\pi}$: $$ \bar{A}(s) = \sum_a \tilde{\pi}(a|s) \cdot A_{\pi}(s, a) = \mathbb{E}_{a \sim \tilde{\pi}(\cdot|s)} [A_{\pi}(s, a)] = \mathbb{E}_{\tilde{a}} [A_{\pi}(s, \tilde{a})] $$ Using this definition, $$ \Delta \eta = \sum_s \rho_{\tilde{\pi}}(s) \cdot \bar{A}(s) \mbox{ and } \zeta = \sum_s \rho_{\pi}(s) \cdot \bar{A}(s) $$ In both of these terms, we can reintroduce the dependency on $t$ by expanding the definition of $\rho_{\tilde{\pi}}(s), \rho_{\pi}(s)$. If we started from some $s_0$, then $\Delta \eta$ is computed by finding the discounted advantage by following $\tilde{\pi}$: $$ \Delta \eta = \sum_s \rho_{\tilde{\pi}}(s) \cdot \bar{A}(s) = \mathbb{E}_{\tau(s_0, \tilde{\pi})} \bigg[ \sum_{t=0}^\infty \gamma^t \cdot \bar{A}(s_t) \bigg] $$ And similarly, $\zeta$ is computed by finding the discounted advantage by following $\pi$: $$ \zeta = \mathbb{E}_{\tau(s_0, \pi)} \bigg[ \sum_{t=0}^\infty \gamma^t \cdot \bar{A}(s_t) \bigg] $$ If $\zeta$ is a good approximation to $\Delta \eta$, the quantity $|\Delta \eta - \zeta|$ should be small: $$ \begin{align*} |\Delta \eta - \zeta| &= \sum_{t=0}^\infty \gamma^t \cdot |\mathbb{E}_{\tau(s_0, \tilde{\pi})}[\bar{A}(s_t)] - \mathbb{E}_{\tau(s_0, \pi)}[\bar{A}(s_t)]| \end{align*} $$

To simplify this, we need some assumptions. Let's say that the stochastic policies $\pi$ and $\tilde{\pi}$ produce different actions for a given state $s$ at most $\alpha$ fraction of time:

\[ P(a \neq \tilde{a} | s) \leq \alpha \]

Let $n_t$ be the number of times $\pi$ and $\tilde{\pi}$ disagree before timestep $t$. Then,

\[ P(n_t > 0) = 1 - P(n_t=0) = 1 - [1-P(a\neq \tilde{a}|\cdot)]^t \leq 1- (1-\alpha)^t \]

Let's try to bound $|\bar{A}(s)|$: $$ \begin{align*} \bar{A}(s) &= \mathbb{E}_{\tilde{a}} [A_{\pi}(s, \tilde{a})] \\ &= \mathbb{E}_{\tilde{a}} [A_{\pi}(s, \tilde{a})] - \mathbb{E}_{a} [A_{\pi}(s, a)] \mbox{ ; $a \sim \pi(\cdot|s), \tilde{a} \sim \tilde{\pi}(\cdot|s)$} \end{align*} $$ We can do this because our reward definition just depends on the state, and since the $V$ function is equal to the $Q$ function in expectation over actions: $$ \mathbb{E}_{a} [A_{\pi}(s, a)] = \mathbb{E}_{a} [Q_{\pi}(s, a)]-V_{\pi}(s)\\ = \mathbb{E}_{a} [r(s)+\gamma\cdot V_\pi(s')]-V_{\pi}(s) = 0 $$ So, if we continue with the bound on $|\bar{A}(s)|$, $$ \begin{align*} \bar{A}(s) &= \mathbb{E}_{\tilde{a}} [A_{\pi}(s, \tilde{a})] - \mathbb{E}_{a} [A_{\pi}(s, a)] \\ &= \mathbb{E}_{a,\tilde{a}} [A_{\pi}(s, \tilde{a}) - A_{\pi}(s, a)] \\ &= P(a \neq \tilde{a} | s) \cdot \mathbb{E}_{a,\tilde{a}} [A_{\pi}(s, \tilde{a}) - A_{\pi}(s, a) | a \neq \tilde{a}, s] + P(a = \tilde{a} | s) \cdot 0 \\ |\bar{A}(s)| &\leq \alpha \cdot |\mathbb{E}_{a,\tilde{a}} [A_{\pi}(s, \tilde{a}) - A_{\pi}(s, a) | a \neq \tilde{a}, s]| \\ &\leq \alpha \cdot 2 \max_a |A_\pi(s, a)| = 2\alpha\epsilon \mbox{ (say)} \end{align*} $$

Then, in our original difference:

$$ \begin{align*} |\Delta \eta - \zeta| &= \sum_{t=0}^\infty \gamma^t \cdot |\mathbb{E}_{\tau(s_0, \tilde{\pi})}[\bar{A}(s_t)] - \mathbb{E}_{\tau(s_0, \pi)}[\bar{A}(s_t)]| \\ \end{align*} $$

Note that each of the expectations can be broken down into two conditions - either there are no disagreements until time $t$, or there are some disagreements until time $t$. If there are no disagreements until time $t$, then the trajectories are the same, hence the expectations cancel out. If there are disagreements, then $n_t > 0$:

$$ \begin{align*} |\Delta \eta - \zeta| &= \sum_{t=0}^\infty \gamma^t \cdot P(n_t > 0) \cdot |\mathbb{E}_{\tau(s_0, \tilde{\pi})}[\bar{A}(s_t); n_t > 0] - \mathbb{E}_{\tau(s_0, \pi)}[\bar{A}(s_t); n_t > 0]| \\ &\leq \sum_{t=0}^\infty \gamma^t \cdot [1- (1-\alpha)^t] \cdot (2\cdot 2\alpha \epsilon) \\ &= \sum_{t=0}^\infty 4\alpha\epsilon\cdot [\gamma^t-\gamma^t(1-\alpha)^t]\\ &= 4\alpha\epsilon \bigg\{ \frac{1}{1-\gamma} - \frac{1}{1-\gamma(1-\alpha)} \bigg\} = \frac{4\alpha\epsilon\cdot \alpha\gamma}{(1-\gamma)(1-\gamma+\alpha\gamma)} \\ &\leq \frac{4\alpha^2\epsilon\gamma}{(1-\gamma)^2} = c_0 \cdot\alpha^2 \mbox{ (say)} \end{align*} $$
The main argument behind TRPO is obtained by noticing that $\alpha$, which is the percentage of time $\pi$ and $\tilde{\pi}$ disagree, can also be converted into a different disagreement metric, called the total variation divergence (which is $\propto \alpha$), and then into the popular metric KL divergence (which is $\propto \alpha^2$). The bound we just obtained tells us that: $$ |\Delta\eta - \zeta| \leq c_0 \cdot \alpha^2 \mbox{, or } \Delta \eta \geq \zeta - c_0 \cdot \alpha^2 = \zeta - c_1 \cdot D_{KL}(\pi, \tilde{\pi}) $$ Hence, if we can maximize $\zeta - c_1 \cdot D_{KL}(\pi, \tilde{\pi})$ at every step, then we are guaranteed the best $\Delta \eta$, and hence the best improvement. Hence, our optimization problem is $$ \mbox{maximize } \zeta \mbox{ ; such that $D_{KL}(\pi, \tilde{\pi})$ is the least} $$ In the practical algorithm, we maximize an empirical estimate of $\zeta$, such that $D_{KL}(\pi, \tilde{\pi})$ is within some constant $\delta$.

$\zeta$ is used in this form, with the importance sampling estimate:

\[ \begin{align*} \zeta &= \mathbb{E}_{\tau(s_0, \pi)} \bigg[ \sum_{t=0}^\infty \gamma^t \cdot \bar{A}(s_t) \bigg] \\ &= \sum_s \rho_\pi(s) \sum_a \pi(a|s)\cdot A_\pi(s, a) \\ &= \sum_s \sum_a \rho_\pi(s) \cdot \pi(a|s)\cdot A_\pi(s, a) \\ &= \sum_s \sum_a \rho_\pi(s) \cdot q(a|s) \cdot \bigg\{ \frac{\pi(a|s)}{q(a|s)} \bigg\} \cdot A_\pi(s, a) \\ &\equiv \sum_s \sum_a \rho_\pi(s) \cdot q(a|s) \cdot \bigg\{ \frac{\pi(a|s)}{q(a|s)} \bigg\} \cdot Q_\pi(s, a) \\ &= \mathbb{E}_ {s \sim \rho_\pi, a \sim q} \bigg[ \frac{\pi(a|s)}{q(a|s)} \cdot Q_\pi(s, a) \bigg] \end{align*} \]

The shift from $A_\pi(s, a)$ to $Q_\pi(s, a)$ shifts the objective by a constant, so it still works.

The paper lists out two ways in which we can do a Monte Carlo approximation for the optimization objective and constraints:
  1. Single Path
    Collect a sample of start states, then generate trajectories from these states until some time $T$, and finally use the old policy for $q(a|s)$.

  2. Vine
    Generate a trajectory $\tau$ using some start state and the old policy, and then choose a subset of states in $\tau$. For each of these states, sample $K$ actions using a $q(a|s)$. With this data, we can compute $\zeta$ and for each $s,a$ pair, we can do a rollout and compute an estimate to $Q_\pi(s,a)$.

References

  1. Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
  2. Kakade, Sham, and John Langford. "Approximately optimal approximate reinforcement learning." ICML. Vol. 2. 2002.