Motivation: monotonous updates to the policy; do policy updates in the best direction, always

Let $\rho_0$ be the distribution of start states. So, for any $s_0 \in \mathbf{S}$, $\rho_0(s_0)$ gives the probability of $s_0$ being a start state. Let's sample some $s_0 \sim \rho_0(s_0)$. For the remainder of this post, $\tau(s_0, \pi)=(s_0, a_0, s_1 \cdots)$ represents a trajectory of states and actions, beginning from state $s_0$ and following $\pi$.

Let $\eta(\pi)$ be the expected total discounted reward from a possible start state, following a stochastic policy $\pi$:

$\eta(\pi) = \mathbb{E}_ {\tau(s_0, \pi)}[r(s_0)+\gamma\cdot r(s_1) + \gamma^2 \cdot r(s_2)+\cdots]$

And, let's define the advantage function for any $s,a$ as:

$A_\pi(s, a) = Q_\pi(s, a)-V_\pi(s)$

Suppose we do some improvement in $\pi$ and get a new policy $\tilde \pi$. Let's compute the expected discounted advantage over the trajectory $\tau(s_0, \tilde{\pi})$: \begin{align*} &\mathbb{E}_{\tau(s_0, \tilde{\pi})}\bigg[ \sum_{t=0}^\infty \gamma^t A_\pi(s_t, a_t) \bigg] \\ &= \mathbb{E}_{\tau(s_0, \tilde{\pi})}\bigg[ \sum_{t=0}^\infty \gamma^t \{ Q_\pi(s_t,a_t)-V_\pi(s_t) \} \bigg] \\ &= \mathbb{E}_{\tau(s_0, \tilde{\pi})}\bigg[ \sum_{t=0}^\infty \gamma^t \{ r(s_t) + \gamma V_\pi(s_{t+1})-V_\pi(s_t) \} \bigg] \\ &= \mathbb{E}_{\tau(s_0, \tilde{\pi})}\bigg[ \sum_{t=0}^\infty \gamma^t r(s_t) \bigg] + \mathbb{E}_{\tau(s_0, \tilde{\pi})}\bigg[ \sum_{t=0}^\infty \gamma^t \{ \gamma V_\pi(s_{t+1})-V_\pi(s_t) \} \bigg] \\ &= \eta(\tilde{\pi}) + \mathbb{E}_{\tau(s_0, \tilde{\pi})}\bigg[ \sum_{t=0}^\infty \gamma^t \{ \gamma V_\pi(s_{t+1})-V_\pi(s_t) \} \bigg] \\ &= \eta(\tilde{\pi}) + \mathbb{E}_{\tau(s_0, \tilde{\pi})}\bigg[ \{ \gamma V_\pi(s_1) + \cdots \}- \{ V_\pi(s_0) + \gamma V_\pi(s_1) + \cdots \} \bigg] \\ &= \eta(\tilde{\pi}) - V_\pi(s_0) = \eta(\tilde{\pi}) - \eta(\pi) = \Delta \eta \end{align*} The improvement $\Delta \eta$ in the expected total discounted reward is the expected discounted advantage.

Let's rearrange this result:

$\Delta \eta = \mathbb{E}_ {\tau(s_0, \tilde{\pi})}\bigg[ \sum_{t=0}^\infty \gamma^t A_\pi(s_t, a_t) \bigg]$

Since we are following the trajectory $\tau(s_ 0, \tilde{\pi})$, we will encounter the term $A_\pi(s, a)$ whenever state $s$ happens, and we sample action $a$ from the stochastic policy. So, at every timestep $t$, if $s_t=s$, we will have the advantage term multiplied by $\gamma^t$. For all states, we can count its weight by computing:

$\rho_{\tilde{\pi}}(s) = I(s_0=s)+\gamma I(s_1=s)+\gamma^2 I(s_2=s) + \cdots$

However, to get the frequency of a state action pair, we need to multiply this quantity by the probability of occurrence of $a$ given $s$, which is $\tilde{\pi}(a|s)$. Then, we can rewrite the result on $\Delta \eta$ by removing the dependence on $t$:

$\Delta \eta = \sum_ {s, a} \rho_{\tilde{\pi}}(s) \cdot \tilde{\pi}(a|s) \cdot A_\pi(s, a) = \sum_s \rho_{\tilde{\pi}}(s) \sum_a \tilde{\pi}(a|s) \cdot A_\pi(s, a)$

It is difficult to optimize this relation to get $\tilde{\pi}$ from $\pi$. Hence, TRPO approximates $\rho_{\tilde{\pi}}(s)$ with $\rho_{\pi}(s)$:

$\Delta \eta \approx \sum_s \rho_{\pi}(s) \sum_a \tilde{\pi}(a|s) \cdot A_\pi(s, a) = \zeta \mbox{ (say)}$

Let's try to show that $\zeta$ is indeed a good approximation to $\Delta \eta$. To do so, first define $\bar{A}(s)$ as the expected advantage from state $s$ using $\tilde{\pi}$: $$\bar{A}(s) = \sum_a \tilde{\pi}(a|s) \cdot A_{\pi}(s, a) = \mathbb{E}_{a \sim \tilde{\pi}(\cdot|s)} [A_{\pi}(s, a)] = \mathbb{E}_{\tilde{a}} [A_{\pi}(s, \tilde{a})]$$ Using this definition, $$\Delta \eta = \sum_s \rho_{\tilde{\pi}}(s) \cdot \bar{A}(s) \mbox{ and } \zeta = \sum_s \rho_{\pi}(s) \cdot \bar{A}(s)$$ In both of these terms, we can reintroduce the dependency on $t$ by expanding the definition of $\rho_{\tilde{\pi}}(s), \rho_{\pi}(s)$. If we started from some $s_0$, then $\Delta \eta$ is computed by finding the discounted advantage by following $\tilde{\pi}$: $$\Delta \eta = \sum_s \rho_{\tilde{\pi}}(s) \cdot \bar{A}(s) = \mathbb{E}_{\tau(s_0, \tilde{\pi})} \bigg[ \sum_{t=0}^\infty \gamma^t \cdot \bar{A}(s_t) \bigg]$$ And similarly, $\zeta$ is computed by finding the discounted advantage by following $\pi$: $$\zeta = \mathbb{E}_{\tau(s_0, \pi)} \bigg[ \sum_{t=0}^\infty \gamma^t \cdot \bar{A}(s_t) \bigg]$$ If $\zeta$ is a good approximation to $\Delta \eta$, the quantity $|\Delta \eta - \zeta|$ should be small: \begin{align*} |\Delta \eta - \zeta| &= \sum_{t=0}^\infty \gamma^t \cdot |\mathbb{E}_{\tau(s_0, \tilde{\pi})}[\bar{A}(s_t)] - \mathbb{E}_{\tau(s_0, \pi)}[\bar{A}(s_t)]| \end{align*}

To simplify this, we need some assumptions. Let's say that the stochastic policies $\pi$ and $\tilde{\pi}$ produce different actions for a given state $s$ at most $\alpha$ fraction of time:

$P(a \neq \tilde{a} | s) \leq \alpha$

Let $n_t$ be the number of times $\pi$ and $\tilde{\pi}$ disagree before timestep $t$. Then,

$P(n_t > 0) = 1 - P(n_t=0) = 1 - [1-P(a\neq \tilde{a}|\cdot)]^t \leq 1- (1-\alpha)^t$

Let's try to bound $|\bar{A}(s)|$: \begin{align*} \bar{A}(s) &= \mathbb{E}_{\tilde{a}} [A_{\pi}(s, \tilde{a})] \\ &= \mathbb{E}_{\tilde{a}} [A_{\pi}(s, \tilde{a})] - \mathbb{E}_{a} [A_{\pi}(s, a)] \mbox{ ; a \sim \pi(\cdot|s), \tilde{a} \sim \tilde{\pi}(\cdot|s)} \end{align*} We can do this because our reward definition just depends on the state, and since the $V$ function is equal to the $Q$ function in expectation over actions: $$\mathbb{E}_{a} [A_{\pi}(s, a)] = \mathbb{E}_{a} [Q_{\pi}(s, a)]-V_{\pi}(s)\\ = \mathbb{E}_{a} [r(s)+\gamma\cdot V_\pi(s')]-V_{\pi}(s) = 0$$ So, if we continue with the bound on $|\bar{A}(s)|$, \begin{align*} \bar{A}(s) &= \mathbb{E}_{\tilde{a}} [A_{\pi}(s, \tilde{a})] - \mathbb{E}_{a} [A_{\pi}(s, a)] \\ &= \mathbb{E}_{a,\tilde{a}} [A_{\pi}(s, \tilde{a}) - A_{\pi}(s, a)] \\ &= P(a \neq \tilde{a} | s) \cdot \mathbb{E}_{a,\tilde{a}} [A_{\pi}(s, \tilde{a}) - A_{\pi}(s, a) | a \neq \tilde{a}, s] + P(a = \tilde{a} | s) \cdot 0 \\ |\bar{A}(s)| &\leq \alpha \cdot |\mathbb{E}_{a,\tilde{a}} [A_{\pi}(s, \tilde{a}) - A_{\pi}(s, a) | a \neq \tilde{a}, s]| \\ &\leq \alpha \cdot 2 \max_a |A_\pi(s, a)| = 2\alpha\epsilon \mbox{ (say)} \end{align*}

Then, in our original difference:

\begin{align*} |\Delta \eta - \zeta| &= \sum_{t=0}^\infty \gamma^t \cdot |\mathbb{E}_{\tau(s_0, \tilde{\pi})}[\bar{A}(s_t)] - \mathbb{E}_{\tau(s_0, \pi)}[\bar{A}(s_t)]| \\ \end{align*}

Note that each of the expectations can be broken down into two conditions - either there are no disagreements until time $t$, or there are some disagreements until time $t$. If there are no disagreements until time $t$, then the trajectories are the same, hence the expectations cancel out. If there are disagreements, then $n_t > 0$:

\begin{align*} |\Delta \eta - \zeta| &= \sum_{t=0}^\infty \gamma^t \cdot P(n_t > 0) \cdot |\mathbb{E}_{\tau(s_0, \tilde{\pi})}[\bar{A}(s_t); n_t > 0] - \mathbb{E}_{\tau(s_0, \pi)}[\bar{A}(s_t); n_t > 0]| \\ &\leq \sum_{t=0}^\infty \gamma^t \cdot [1- (1-\alpha)^t] \cdot (2\cdot 2\alpha \epsilon) \\ &= \sum_{t=0}^\infty 4\alpha\epsilon\cdot [\gamma^t-\gamma^t(1-\alpha)^t]\\ &= 4\alpha\epsilon \bigg\{ \frac{1}{1-\gamma} - \frac{1}{1-\gamma(1-\alpha)} \bigg\} = \frac{4\alpha\epsilon\cdot \alpha\gamma}{(1-\gamma)(1-\gamma+\alpha\gamma)} \\ &\leq \frac{4\alpha^2\epsilon\gamma}{(1-\gamma)^2} = c_0 \cdot\alpha^2 \mbox{ (say)} \end{align*}
The main argument behind TRPO is obtained by noticing that $\alpha$, which is the percentage of time $\pi$ and $\tilde{\pi}$ disagree, can also be converted into a different disagreement metric, called the total variation divergence (which is $\propto \alpha$), and then into the popular metric KL divergence (which is $\propto \alpha^2$). The bound we just obtained tells us that: $$|\Delta\eta - \zeta| \leq c_0 \cdot \alpha^2 \mbox{, or } \Delta \eta \geq \zeta - c_0 \cdot \alpha^2 = \zeta - c_1 \cdot D_{KL}(\pi, \tilde{\pi})$$ Hence, if we can maximize $\zeta - c_1 \cdot D_{KL}(\pi, \tilde{\pi})$ at every step, then we are guaranteed the best $\Delta \eta$, and hence the best improvement. Hence, our optimization problem is $$\mbox{maximize } \zeta \mbox{ ; such that D_{KL}(\pi, \tilde{\pi}) is the least}$$ In the practical algorithm, we maximize an empirical estimate of $\zeta$, such that $D_{KL}(\pi, \tilde{\pi})$ is within some constant $\delta$.

$\zeta$ is used in this form, with the importance sampling estimate:

\begin{align*} \zeta &= \mathbb{E}_{\tau(s_0, \pi)} \bigg[ \sum_{t=0}^\infty \gamma^t \cdot \bar{A}(s_t) \bigg] \\ &= \sum_s \rho_\pi(s) \sum_a \pi(a|s)\cdot A_\pi(s, a) \\ &= \sum_s \sum_a \rho_\pi(s) \cdot \pi(a|s)\cdot A_\pi(s, a) \\ &= \sum_s \sum_a \rho_\pi(s) \cdot q(a|s) \cdot \bigg\{ \frac{\pi(a|s)}{q(a|s)} \bigg\} \cdot A_\pi(s, a) \\ &\equiv \sum_s \sum_a \rho_\pi(s) \cdot q(a|s) \cdot \bigg\{ \frac{\pi(a|s)}{q(a|s)} \bigg\} \cdot Q_\pi(s, a) \\ &= \mathbb{E}_ {s \sim \rho_\pi, a \sim q} \bigg[ \frac{\pi(a|s)}{q(a|s)} \cdot Q_\pi(s, a) \bigg] \end{align*}

The shift from $A_\pi(s, a)$ to $Q_\pi(s, a)$ shifts the objective by a constant, so it still works.

The paper lists out two ways in which we can do a Monte Carlo approximation for the optimization objective and constraints:
1. Single Path
Collect a sample of start states, then generate trajectories from these states until some time $T$, and finally use the old policy for $q(a|s)$.

2. Vine
Generate a trajectory $\tau$ using some start state and the old policy, and then choose a subset of states in $\tau$. For each of these states, sample $K$ actions using a $q(a|s)$. With this data, we can compute $\zeta$ and for each $s,a$ pair, we can do a rollout and compute an estimate to $Q_\pi(s,a)$.