# Trust Region Policy Optimization

Let $\rho_0$ be the distribution of start states. So, for any $s_0 \in \mathbf{S}$, $\rho_0(s_0)$ gives the probability of $s_0$ being a start state. Let's sample some $s_0 \sim \rho_0(s_0)$. For the remainder of this post, $\tau(s_0, \pi)=(s_0, a_0, s_1 \cdots)$ represents a trajectory of states and actions, beginning from state $s_0$ and following $\pi$.

Let $\eta(\pi)$ be the expected total discounted reward from a possible start state, following a stochastic policy $\pi$:

\[ \eta(\pi) = \mathbb{E}_ {\tau(s_0, \pi)}[r(s_0)+\gamma\cdot r(s_1) + \gamma^2 \cdot r(s_2)+\cdots] \]

And, let's define the advantage function for any $s,a$ as:

\[ A_\pi(s, a) = Q_\pi(s, a)-V_\pi(s) \]

Let's rearrange this result:

\[ \Delta \eta = \mathbb{E}_ {\tau(s_0, \tilde{\pi})}\bigg[ \sum_{t=0}^\infty \gamma^t A_\pi(s_t, a_t) \bigg] \]

Since we are following the trajectory $\tau(s_ 0, \tilde{\pi})$, we will encounter the term $A_\pi(s, a)$ whenever state $s$ happens, **and** we sample action $a$ from the stochastic policy. So, at every timestep $t$, if $s_t=s$, we will have the advantage term multiplied by $\gamma^t$. For all states, we can count its weight by computing:

\[ \rho_{\tilde{\pi}}(s) = I(s_0=s)+\gamma I(s_1=s)+\gamma^2 I(s_2=s) + \cdots \]

However, to get the frequency of a state action pair, we need to multiply this quantity by the probability of occurrence of $a$ given $s$, which is $\tilde{\pi}(a|s)$. Then, we can rewrite the result on $\Delta \eta$ by removing the dependence on $t$:

\[ \Delta \eta = \sum_ {s, a} \rho_{\tilde{\pi}}(s) \cdot \tilde{\pi}(a|s) \cdot A_\pi(s, a) = \sum_s \rho_{\tilde{\pi}}(s) \sum_a \tilde{\pi}(a|s) \cdot A_\pi(s, a) \]

It is difficult to optimize this relation to get $\tilde{\pi}$ from $\pi$. Hence, TRPO approximates $\rho_{\tilde{\pi}}(s)$ with $\rho_{\pi}(s)$:

\[ \Delta \eta \approx \sum_s \rho_{\pi}(s) \sum_a \tilde{\pi}(a|s) \cdot A_\pi(s, a) = \zeta \mbox{ (say)} \]

To simplify this, we need some assumptions. Let's say that the stochastic policies $\pi$ and $\tilde{\pi}$ produce different actions for a given state $s$ at most $\alpha$ fraction of time:

\[ P(a \neq \tilde{a} | s) \leq \alpha \]

Let $n_t$ be the number of times $\pi$ and $\tilde{\pi}$ disagree before timestep $t$. Then,

\[ P(n_t > 0) = 1 - P(n_t=0) = 1 - [1-P(a\neq \tilde{a}|\cdot)]^t \leq 1- (1-\alpha)^t \]

Then, in our original difference:

Note that each of the expectations can be broken down into two conditions - either there are no disagreements until time $t$, or there are some disagreements until time $t$. If there are no disagreements until time $t$, then the trajectories are the same, hence the expectations cancel out. If there are disagreements, then $n_t > 0$:

$\zeta$ is used in this form, with the **importance sampling** estimate:

\[ \begin{align*} \zeta &= \mathbb{E}_{\tau(s_0, \pi)} \bigg[ \sum_{t=0}^\infty \gamma^t \cdot \bar{A}(s_t) \bigg] \\ &= \sum_s \rho_\pi(s) \sum_a \pi(a|s)\cdot A_\pi(s, a) \\ &= \sum_s \sum_a \rho_\pi(s) \cdot \pi(a|s)\cdot A_\pi(s, a) \\ &= \sum_s \sum_a \rho_\pi(s) \cdot q(a|s) \cdot \bigg\{ \frac{\pi(a|s)}{q(a|s)} \bigg\} \cdot A_\pi(s, a) \\ &\equiv \sum_s \sum_a \rho_\pi(s) \cdot q(a|s) \cdot \bigg\{ \frac{\pi(a|s)}{q(a|s)} \bigg\} \cdot Q_\pi(s, a) \\ &= \mathbb{E}_ {s \sim \rho_\pi, a \sim q} \bigg[ \frac{\pi(a|s)}{q(a|s)} \cdot Q_\pi(s, a) \bigg] \end{align*} \]

The shift from $A_\pi(s, a)$ to $Q_\pi(s, a)$ shifts the objective by a constant, so it still works.

## References

- Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
- Kakade, Sham, and John Langford. "Approximately optimal approximate reinforcement learning." ICML. Vol. 2. 2002.