RUDDER: Return Decomposition for Delayed Rewards

In reinforcement learning, we often want to do credit assignment, that is, assign credit for a received reward to past actions. Formally, if we have a trajectory $S_1,A_1,S_2,A_2,\cdots,S_T$ and a terminal reward $R_T$, we want to redistribute this sparse delayed non-Markov reward to earlier timesteps in the trajectory so that the reward becomes immediate and non-sparse. This paper \cite{arjona2018rudder} deals with doing exactly that, i.e. calculate which trajectory actions were instrumental in achieving the reward outcome.

Sequence Markov Decision Processes (SDPs)

A finite MDP $M$ is a 5-tuple $M=(\mathcal S, \mathcal A, \mathcal R, p, \gamma)$ where $\mathcal S$ is the state space, $\mathcal A$ is the action space, $\mathcal R$ is the reward function, $p(S_{t+1}, R_t | S_t, A_t)$ is the transition reward distribution, and $\gamma \in [0, 1]$ is the discount factor. Markov decision processes usually come equipped with Markov policies $\pi(A_t|S_t)$.

A sequence MDP or SDP is a decision process with a Markov transition reward distribution and a Markov policy, but reward that is not required to be Markov. That means the reward could depend on the history of states ($\mathcal R(S_{1:t})$) rather just than the immediate state ($\mathcal R(S_t)$).

Note that all MDPs are SDPs by definition.

Return Equivalent SDPs

Now we define return equivalence, which will come handy later. Basically in reward redistribution we want to convert our original SDP $M$ with delayed reward to an SDP $M’$ with immediate rewards. So, we define return equivalence as follows: two SDPs are return equivalent if they have different reward distributions but given a policy $\pi$, they have the same expected return at $t=1$: $(v_1^\pi)_ M = (v _1^\pi)_{M’}$.

It is easy to see that return equivalent SDPs conserve optimal policies. An optimal policy is defined as a policy that maximizes the expected return at $t=1$. Since the expected return at $t=1$ is the same across two return equivalent SDPs, they conserve optimal policies too.

$$ \mathbb E_\pi[(G_1)_M|s_1] = \mathbb E_\pi[(G_1)_{M'}|s_1] $$
That is, while converting from one return equivalent SDP to another, the optimal policy stays the same. This means that we can convert our original SDP with a delayed reward to an SDP with immediate rewards while our optimal policy stays the same.

Strictly Return Equivalent SDPs

Based on the concept of return equivalent SDPs, we can define strictly return equivalent SDPs. Return equivalent SDPs which have the same expected return for every episode and every policy are called strictly return equivalent SDPs. That is,

$$ \mathbb E_\pi[(G_1)_{M}|s_1,a_1,\cdots,s_T,a_T] = \mathbb E_\pi[(G_1)_{M'}|s_1,a_1,\cdots,s_T,a_T] $$

Reward Redistribution

Given all these formalisms, let’s recall the objective. Our goal is to do reward redistribution. More specifically, for a trajectory $s_1,a_1,s_2,a_2,\cdots,s_T, a_T$ we wish to redistribute the return $G_1$ (or the expected return) along the trajectory. Since such a redistribution conserves expected return at time $t=1$, we are essentially converting our SDP with the given reward into another SDP with a different reward distribution.

If we do reward redistribution for every trajectory, we are converting our SDP to a strictly return equivalent SDP.

Optimal Reward Redistribution

How should we do our reward redistribution? This is the main idea as expressed in the paper. Given an SDP $M$ with a non Markov delayed reward, we wish to obtain a strictly return equivalent SDP $M’$ with immediate rewards. In this new SDP, we want to ensure that at every timestep, the expected future reward is $0$. More technically, recall that $$ Q^\pi (s_t,a_t) = r(s_t,a_t) + \mathbb E_\pi[G_{t+1}|s_t,a_t] $$ If we set $\mathbb E_\pi[G_{t+1}|s_t,a_t] := 0$ then all the reward for $(s_t,a_t)$ is given immediately. The task of estimating the $Q$ value is now equivalent to estimating $\mathbb E_\pi[R_t|s_t,a_t]$.

Considering a second order Markov SDP, the paper proves that

$$ \mathbb E_\pi[(R_t)_{M'}|s_{t-1},a_{t-1},s_t,a_t] = (Q^\pi(s_t,a_t)-Q^\pi(s_{t-1},a_{t-1}))_M $$
That is, the expected immediate reward in the new SDP is equivalent to difference of $Q$ values in the old SDP.

Novel Learning Algorithms

This difference in $Q$ values is a good way to estimate immediate rewards and convert to the new SDP. In the new SDP, since the reward redistribution is optimal, it is apparently trivial to compute $Q$ values. Therefore, we can easily perform on-policy direct $Q$ estimation using:

$$ (Q^\pi(s_ t,a_ t))_ {M'} = (Q^\pi(s_ t,a_ t))_ M - \mathbb E_{s _{t-1}, a_ {t-1}|s_ t}[(Q^\pi(s_{t-1}, a_ {t-1}))_ M] $$
And we can do off-policy $Q$ estimation with a behaviour policy $\pi'$ as follows:
$$ (Q^\pi(s_t,a_t))_{M'} = (Q^\pi(s_t,a_t))_M - \mathbb E_{s_{t-1}, a_{t-1},\pi'|s_t}[(Q^\pi(s_{t-1}, a_{t-1}))_M] $$
Another approach is to do policy gradients, where $Q^\pi$ can be computed trivially.

Explaining Away Problem

In doing all of this, the explaining away problem may arise. That is, true causes of a reward outcome may not be attributed since a later state can also explain away the reward outcome.

For example, in a door key environment where the objective is to pick up a key, say the pick up happens at timestep $t_0$. However, this information of having picked up a key is also present in states starting from $t_0$ until the last step. In such a case, reward redistribution can attribute any state-action pair after $t_0$ as a step which explains the reward. We see a key in one of those states, and hence we get a reward, which is incorrect reasoning.

To solve this problem, the authors propose taking a difference of states to only capture the changes in the environment. Specifically, given a difference function $\Delta$, we compute $\Delta_{1:T}$, which are assumed to be mutually independent of each other: $$ \Delta_{1:T} = (\Delta(s_{0},a_{0},s_1,a_1),\cdots,\Delta(s_{T-1},a_{T-1},s_T,a_T)) $$

Differences in Q values

An LSTM function $g$ predicts the return at every timestep. Here the contributions $h$ are calculated using the differences of return predictions: $$ h(\Delta_{1:t}) = g(\Delta_{1:t})-g(\Delta_{1:t-1}) $$ Finally, all the contributions must sum up to the final delayed return in the old SDP. $$ g(\Delta_{1:T}) = \sum_t h(\Delta_{1:t}) = \mathbb E[(R_T)_M|s_T,a_T] $$ Intuitively, for a binary delayed reward, in the new SDP, the immediate rewards represent the change in success probability, which corresponds to the difference in $Q$ values in the old SDP.

Integrated Gradients

The second method the paper uses is integrated gradients (IG) \cite{sundararajan2017axiomatic}. IG is an explainable AI technique used to explain a model’s prediction in terms of its input features. So in our case, we can use it to explain the predicted returns by constructing contribution values.

Our input to the model is $x=\Delta_{1:T}$. We can interpolate between a baseline $\tilde x = 0$ and our input $x = \Delta_{1:T}$, and compute gradient of the model predictions w.r.t. the inputs. Then, we can use the formula $$ IG_i(x) := (x_i-\tilde x_i) \times \sum_{k=1}^m \frac{1}{m} \times \frac{\partial F(interpolated)}{\partial x_i} $$ For more information about IG, consider going through the Tensorflow tutorials.

Layerwise Relevance Propagation

The third and final method the paper focuses on is called Layerwise Relevance Propagation (LRP) \cite{bach2015pixel}. LRP recursively propagates back relevance values, much like backpropagation. It starts at the output and redistributes the relevance values of every layer to its lower layer. The redistribution satisfies a relevance conservation principle, in that every node distributes all the relevance it receives from higher layers to lower layers.